No description
Find a file
2026-06-22 19:07:33 +02:00
dfs.S add asm routines 2026-06-22 19:07:24 +02:00
invert_binary_tree.S add asm routines 2026-06-22 19:07:24 +02:00
invert_linked_list.S add asm routines 2026-06-22 19:07:24 +02:00
kmeans.S add asm routines 2026-06-22 19:07:24 +02:00
mat4_mul.S add asm routines 2026-06-22 19:07:24 +02:00
README.md add readme file 2026-06-22 19:07:33 +02:00
simple_linear_regression.S add asm routines 2026-06-22 19:07:24 +02:00

a53 asm sketchbook

A collection of hand-written AArch64 (ARMv8-A) assembly routines targeting the ARM Cortex-A53. Some are old, some are new. I decided to publish them as they are, so think of this less as a library and more as a sketchbook of my experiments with the core.

The point is mostly pedagogical. Each file is a worked example of why the instructions are arranged the way they are on this particular core, not just what they compute.

Two themes run through the code:

  1. NEON / data-parallel kernels (simple_linear_regression.S, kmeans.S, mat4_mul.S), where loop unrolling and load ordering are tuned for the A53 pipeline.
  2. Stack / recursion / ABI examples (dfs.S, invert_binary_tree.S, invert_linked_list.S), which show how to set up frames, use a variable-length stack allocation (alloca-style), and honour the procedure-call standard.

Why the Cortex-A53 changes how you write the code

The A53 is an in-order, dual-issue superscalar core. Three properties dominate every decision in this repo.

  • In-order issue. Instructions are issued in program order. The hardware cannot look ahead and reorder around a stalled instruction the way an out-of-order core (Cortex-A72, A78, Apple cores, and so on) does. If instruction N stalls waiting on a result or a load, the pipeline stalls, and instruction N+1 does not sneak past it. So the programmer (or compiler) is responsible for scheduling: you have to manually interleave independent work so there is something useful to issue while a long-latency operation is in flight.

  • Dual-issue, but with restrictions. The A53 can issue up to two instructions per cycle, but only for certain pairings. To benefit you need a steady supply of mutually independent instructions adjacent in the stream. Long dependency chains serialize and waste half the machine.

  • A shallow memory subsystem that stalls easily. The L1 data cache can only keep a very small number of outstanding misses in flight at once (its linefill buffers, the MSHR-style slots, number around 3). Once those are occupied, the next miss cannot even be issued, and because the core is in-order the whole pipeline stalls behind it. The practical effect is that the A53 is not just slow on a single miss, it is slow on a burst of misses, because it cannot overlap many of them. This is exactly why software prefetch (below) earns its keep: it spends those scarce slots ahead of time, on addresses you know you will need, instead of discovering the misses too late.

The practical consequences, all visible in the kernels here:

Technique Why it matters on A53
Loop unrolling Amortizes loop overhead and, more importantly, exposes more independent instructions per iteration so the dual-issue slots stay full.
Multiple accumulator chains A single fmla accumulator creates a serial dependency. Each fmla waits on the previous one's result, and NEON FP latency on the A53 is about 4 cycles, so a single chain runs at roughly 1/4 of peak. Splitting into two or more independent accumulators lets the pipeline overlap them and hide that latency.
Hoisting / batching loads NEON loads have multi-cycle latency and the core can't reorder around them. Issuing all the loads for an iteration up front gives the load/store unit time to retire them before the arithmetic that consumes them is reached.
Software prefetch (prfm) The A53's hardware prefetcher is modest and, as noted above, only a handful of misses can be outstanding at once. Explicit prfm pldl1keep a few cache lines ahead spends those slots early and hides L2/DRAM latency for streaming workloads.
Avoiding read-after-write hazards on NEON <-> scalar moves Reductions are written into disjoint scratch registers so the consuming move doesn't immediately depend on the just-written value.

On an out-of-order core most of this is invisible, since the hardware recovers the parallelism for you. On the A53 it is the difference between near-peak throughput and a pipeline that idles on every dependency.


ABI / calling convention (AAPCS64) cheat-sheet

The routines follow the standard AArch64 procedure call standard. Quick reference for reading the prologues and epilogues:

  • Argument / result GPRs: x0-x7. Return value in x0 (w0 for 32-bit).
  • Caller-saved (volatile) GPRs: x0-x18. Free to clobber, not preserved across a call.
  • Callee-saved GPRs: x19-x28. Must be saved and restored if used.
  • x29 = frame pointer, x30 = link register. Saved on entry for any non-leaf function.
  • Stack pointer sp must stay 16-byte aligned at any public function boundary.
  • NEON/FP: v0-v7 args/results, v16-v31 caller-saved (fully volatile), and v8-v15 callee-saved but only their low 64 bits (d8-d15).
  • Leaf functions (no calls of their own) need no frame and may freely use volatile registers without a prologue.

Building

These are GAS-syntax .S files. Assemble with an AArch64 toolchain, for example:

# native on an aarch64 host
as -o mat4_mul.o mat4_mul.S

# cross from x86_64
aarch64-linux-gnu-gcc -c -march=armv8-a+simd mat4_mul.S -o mat4_mul.o

# tuned scheduling (if you let the assembler/peephole help)
aarch64-linux-gnu-gcc -c -mcpu=cortex-a53 simple_linear_regression.S -o slr.o

There is no driver or test harness in the repo. Each file exports a symbol meant to be linked into a larger program. The functions that call out (dfs.S calls visit) expect you to provide the callee.


Routine index

File Symbol Kind One-liner
simple_linear_regression.S simple_linear_regression_neon8 NEON Least-squares slope/intercept over float arrays, 8-wide unrolled with prefetch plus 4-wide plus scalar tails.
kmeans.S kmeans_assign NEON k-means assignment step, labels n 3-D points by nearest of 3 centroids (squared L2), branchless argmin via bit.
mat4_mul.S mat4_mul NEON 4x4 single-precision matrix multiply, column-major, broadcast-fmla.
dfs.S dfs_preorder_iter Stack/ABI Iterative pre-order tree traversal using a stack-allocated (VLA / alloca-style) explicit stack.
invert_binary_tree.S invert_binary_tree Stack/ABI Recursive binary-tree mirror, canonical callee-saved plus frame example.
invert_linked_list.S invert_linked_list Stack/ABI In-place singly-linked-list reversal, leaf function, zero stack usage.

NEON routines: layout rationale

simple_linear_regression.S (simple_linear_regression_neon8)

int simple_linear_regression_neon8(
        const float* x,  // x0
        const float* y,  // x1
        size_t n,        // x2
        float* out_slope,      // x3
        float* out_intercept); // x4
// returns 0 on success, -1 if n <= 1, -2 if the denominator is zero

Computes the four running sums a least-squares fit needs (sum_x, sum_y, sum_xy, sum_x2), then finishes with the closed-form slope and intercept. The structure is a three-stage descent: an 8-wide vector main loop, a single 4-wide vector cleanup, and a scalar tail.

Why it's laid out this way:

  • 8 elements per iteration with paired A/B accumulators. Each of the four sums has two independent accumulators (v0/v1 for sum_x, v2/v3 for sum_y, v4/v5 for sum_xy, v6/v7 for sum_x2). The "A" chain works on lanes i..i+3, the "B" chain on i+4..i+7. A single accumulator would force every fadd/fmla to wait on the previous one (NEON FP latency about 4 cycles, throughput about 1 per cycle), so the chain would run at roughly 1/4 of peak. Two independent chains let the pipeline keep both in flight, roughly doubling utilization. The two partials are summed once, after the loop (A + B), so the per-iteration cost stays minimal.

  • All four loads issued up front. The ld1 for xA, xB, yA, yB are grouped before any arithmetic. Because the A53 is in-order, putting the loads first gives the load/store unit a head start. By the time the fadd/fmla that consume them are reached, the data is more likely to be ready, instead of stalling the whole pipeline.

  • prfm pldl1keep about 256 bytes ahead on both streams. This is a pure streaming kernel, and with only a few outstanding misses possible (see the A53 notes above), the prefetch hides L2/memory latency that the hardware prefetcher would not fully cover.

  • fmla for the products. sum_xy and sum_x2 are fused multiply-accumulates: one instruction, one rounding, and it keeps the accumulator dependency local to a single op.

  • Tail handling without re-doing work. After the main loop, n % 8 can still hold one full group of 4, processed vectorized in vec4_setup, then the final 0..3 elements run scalar. The fall-through is deliberate. If the main loop was skipped (n < 8) the merged sums in v0/v2/v4/v6 are simply the zero-initialized partials, so the 4-wide stage is still correct.

  • Hazard-free horizontal reduction. The faddp reductions write into disjoint scratch (v20-v23), and only then are the values fmov'd into the scalar registers the final math uses. Sourcing the move from a register other than the one just written avoids an immediate read-after-write stall on the NEON-to-scalar transfer.

  • Leaf-clean scratch in the final math. The closed-form slope and intercept use s6/s7 and s16-s18 as scratch, all caller-saved. It deliberately avoids s8-s15 (the low halves of v8-v15, which are callee-saved under AAPCS64), so the routine stays a true leaf with no prologue or epilogue and never corrupts a caller's d8-d15.

kmeans.S (kmeans_assign)

void kmeans_assign(const float* xs,     // x0  point x coords (n floats)
                   const float* ys,     // x1  point y coords
                   const float* zs,     // x2  point z coords
                   const float* cents,  // x3  9 floats: c0x,c0y,c0z, c1.., c2..
                   uint32_t*    labels, // x4  n outputs, nearest centroid index 0..2
                   size_t       n);     // x5  point count, must be >= 4

The assignment step of k-means for 3 centroids over 3-D points. For every point, find the nearest of 3 centroids by squared Euclidean distance and write its index. This is the assignment half of Lloyd's algorithm; the centroid-update half is a separate concern and is not included here. The per-4-point distance and argmin kernel, the original "heart" of the file, is now a .macro (KM_KERNEL) driven by a streaming loop.

Why it's laid out this way:

  • SoA (structure-of-arrays) lane layout. X, Y, Z each live in their own array, loaded into v0/v1/v2 with one point per lane. This makes the distance computation pure vertical SIMD, with no shuffles and no cross-lane work, so 4 points are handled in lockstep per iteration.

  • Centroids broadcast once, before the loop. Nine ld1r splat each centroid coordinate across all four lanes of v31..v23. Hoisting this out of the loop means the hot path is nothing but fsub/fmul/fmla with no dup in the critical path. The argmin constants 1 and 2 (v21/v22) are likewise materialized once. Because the whole routine is a leaf that makes no calls, these live happily in caller-saved v16-v31 for its entire duration.

  • Loads issued up front. The three ld1s for x/y/z lead each iteration, giving the in-order load/store unit a head start before the kernel's fsubs are reached.

  • fmul then two fmla per distance. dx^2 seeds the accumulator with fmul, then + dy^2 and + dz^2 are fused. Distances for the three centroids land in separate registers (v16, v17, v18), so the three blocks are independent and can interleave in the pipeline rather than chaining through one accumulator.

  • Branchless argmin via bit. Instead of per-lane branches, which would defeat SIMD, the selection is a compare-and-blend. fcmgt builds a per-lane mask of "this distance beats the current best", and bit (bitwise insert-if-true) conditionally updates both the running index (v20) and the running best distance (v19). Critically, the index is updated first, using the still-valid mask, before best is overwritten. Otherwise the second update would key off the wrong comparison.

  • Squared distance, no sqrt. The argmin of distance and of distance-squared are identical, because sqrt is monotonic, so the routine skips four fsqrts per point.

  • Overlapping vector tail. The main loop runs n / 4 full groups. Any remainder (n & 3, that is 1 to 3 points) is handled by rewinding the pointers so one final 4-wide group covers the last 4 elements. The overlapped lanes are recomputed and rewritten with identical values, which is cheaper and simpler than a scalar tail and stays branch-light. This is why the contract requires n >= 4.

Verified against a scalar C reference for n = 4..23, covering every n & 3 remainder.

mat4_mul.S (mat4_mul)

void mat4_mul(const float* A,  // x0  column-major 4x4
              const float* B,  // x1  column-major 4x4
              float* C);       // x2  = A * B

A classic broadcast-fmla 4x4 multiply, column-major, where each q register is one column.

Why it's laid out this way:

  • Whole matrices loaded with ldp q,q. Two ldp per matrix pull all four columns of A (v31..v28) and B (v27..v24) in just four instructions, issued up front so the load unit runs ahead of the arithmetic.

  • A column of C is a linear combination of A's columns. Each output column C[:,j] is the sum over k of A[:,k] * B[k,j]. With B column-major, B[k,j] is lane k of B's column j, so the code uses the broadcast form fmla v, A_col, B_col.s[k]: multiply A's column by a single scalar lane of B, broadcast across all 4 lanes. No transposes, no horizontal adds.

  • Two output columns computed in parallel. Columns 0 and 1 accumulate into v0 and v1 with their fmlas interleaved, so the two independent dependency chains overlap. The first pair C[:,0..1] is stored before the second pair is computed, freeing registers and spacing out the work.

  • Leaf function, no prologue. It only touches v0-v3 and v24-v31, all caller-saved, and makes no calls, so no frame setup is needed.


Stack / recursion / ABI routines

dfs.S (dfs_preorder_iter, variable-length stack allocation, alloca-style)

// x0: root node ptr   x1: stack slot count (capacity)
// node layout: left ptr at offset 0, right ptr at offset 8
// calls out to visit(node) for each node, pre-order

Iterative pre-order DFS that keeps its own explicit work-stack in a stack-allocated variable-length buffer, the assembly equivalent of alloca or a C99 VLA. This is the example to study for "how do I carve scratch space off the stack at runtime?"

Key mechanics:

  • Frame setup with a saved FP. The prologue saves x29/x30, the callee-saved registers it uses (x19, x20, x21), and sets x29 = sp. Keeping the frame pointer is what makes the VLA cleanup trivial.
  • Runtime-sized allocation. slots * 8, rounded up to a 16-byte multiple (+15, & ~15), is subtracted from sp. This keeps sp 16-byte aligned regardless of the requested count, the rule you must never violate.
  • The explicit stack lives in that buffer. x20 is its base, x19 its moving top. str x0, [x19], #8 pushes (post-increment), and ldr x21, [x19, #-8]! pops (pre-decrement). Emptiness is detected by comparing the top against the base.
  • Pre-order with a LIFO twist. The right child is pushed before the left, so the left is popped first, which yields node-before-children order. The popped node is kept in x21, a callee-saved register, precisely because it must survive the bl visit call.
  • O(1) teardown. mov sp, x29 discards the entire VLA in a single instruction, with no need to track how much was allocated, then the saved registers are restored.

The caller must size the buffer for the maximum simultaneous stack occupancy, which for a push-both-children traversal is bounded by the number of pending nodes, not strictly the tree's depth.

invert_binary_tree.S (invert_binary_tree, textbook recursion)

node* invert_binary_tree(node* root); // x0 in / x0 out; left@0, right@8

Mirrors a binary tree by swapping every node's children. The cleanest example of a recursive non-leaf prologue and epilogue:

  • Saves x29/x30 and one callee-saved register x19 in a 32-byte aligned frame.
  • Caches the current node in x19 because x0 (volatile) is destroyed by the recursive bls. The swap itself is a neat ldp x1,x0,[x0] then stp x0,x1,[x19]: load both children and store them back in swapped order in two instructions.
  • After recursing left, it reloads the (now-swapped) right child from x19 rather than trusting any volatile register to have survived the first recursive call. That is the single most common recursion bug, made explicit here.

invert_linked_list.S (invert_linked_list, leaf, zero stack)

node* invert_linked_list(node* head); // x0 in / new head out; next ptr @0

In-place singly-linked-list reversal. Included as the contrast case: it is a leaf function that needs no stack frame at all. The classic three-pointer walk (prev/curr/next) lives entirely in volatile registers (x1, x2, x3), so there is nothing to save and the prologue and epilogue collapse to a single ret. When a routine makes no calls and fits in volatile registers, this is what you want. Don't pay for a frame you don't need.