- Assembly 100%
| dfs.S | ||
| invert_binary_tree.S | ||
| invert_linked_list.S | ||
| kmeans.S | ||
| mat4_mul.S | ||
| README.md | ||
| simple_linear_regression.S | ||
a53 asm sketchbook
A collection of hand-written AArch64 (ARMv8-A) assembly routines targeting the ARM Cortex-A53. Some are old, some are new. I decided to publish them as they are, so think of this less as a library and more as a sketchbook of my experiments with the core.
The point is mostly pedagogical. Each file is a worked example of why the instructions are arranged the way they are on this particular core, not just what they compute.
Two themes run through the code:
- NEON / data-parallel kernels (
simple_linear_regression.S,kmeans.S,mat4_mul.S), where loop unrolling and load ordering are tuned for the A53 pipeline. - Stack / recursion / ABI examples (
dfs.S,invert_binary_tree.S,invert_linked_list.S), which show how to set up frames, use a variable-length stack allocation (alloca-style), and honour the procedure-call standard.
Why the Cortex-A53 changes how you write the code
The A53 is an in-order, dual-issue superscalar core. Three properties dominate every decision in this repo.
-
In-order issue. Instructions are issued in program order. The hardware cannot look ahead and reorder around a stalled instruction the way an out-of-order core (Cortex-A72, A78, Apple cores, and so on) does. If instruction N stalls waiting on a result or a load, the pipeline stalls, and instruction N+1 does not sneak past it. So the programmer (or compiler) is responsible for scheduling: you have to manually interleave independent work so there is something useful to issue while a long-latency operation is in flight.
-
Dual-issue, but with restrictions. The A53 can issue up to two instructions per cycle, but only for certain pairings. To benefit you need a steady supply of mutually independent instructions adjacent in the stream. Long dependency chains serialize and waste half the machine.
-
A shallow memory subsystem that stalls easily. The L1 data cache can only keep a very small number of outstanding misses in flight at once (its linefill buffers, the MSHR-style slots, number around 3). Once those are occupied, the next miss cannot even be issued, and because the core is in-order the whole pipeline stalls behind it. The practical effect is that the A53 is not just slow on a single miss, it is slow on a burst of misses, because it cannot overlap many of them. This is exactly why software prefetch (below) earns its keep: it spends those scarce slots ahead of time, on addresses you know you will need, instead of discovering the misses too late.
The practical consequences, all visible in the kernels here:
| Technique | Why it matters on A53 |
|---|---|
| Loop unrolling | Amortizes loop overhead and, more importantly, exposes more independent instructions per iteration so the dual-issue slots stay full. |
| Multiple accumulator chains | A single fmla accumulator creates a serial dependency. Each fmla waits on the previous one's result, and NEON FP latency on the A53 is about 4 cycles, so a single chain runs at roughly 1/4 of peak. Splitting into two or more independent accumulators lets the pipeline overlap them and hide that latency. |
| Hoisting / batching loads | NEON loads have multi-cycle latency and the core can't reorder around them. Issuing all the loads for an iteration up front gives the load/store unit time to retire them before the arithmetic that consumes them is reached. |
Software prefetch (prfm) |
The A53's hardware prefetcher is modest and, as noted above, only a handful of misses can be outstanding at once. Explicit prfm pldl1keep a few cache lines ahead spends those slots early and hides L2/DRAM latency for streaming workloads. |
| Avoiding read-after-write hazards on NEON <-> scalar moves | Reductions are written into disjoint scratch registers so the consuming move doesn't immediately depend on the just-written value. |
On an out-of-order core most of this is invisible, since the hardware recovers the parallelism for you. On the A53 it is the difference between near-peak throughput and a pipeline that idles on every dependency.
ABI / calling convention (AAPCS64) cheat-sheet
The routines follow the standard AArch64 procedure call standard. Quick reference for reading the prologues and epilogues:
- Argument / result GPRs:
x0-x7. Return value inx0(w0for 32-bit). - Caller-saved (volatile) GPRs:
x0-x18. Free to clobber, not preserved across a call. - Callee-saved GPRs:
x19-x28. Must be saved and restored if used. x29= frame pointer,x30= link register. Saved on entry for any non-leaf function.- Stack pointer
spmust stay 16-byte aligned at any public function boundary. - NEON/FP:
v0-v7args/results,v16-v31caller-saved (fully volatile), andv8-v15callee-saved but only their low 64 bits (d8-d15). - Leaf functions (no calls of their own) need no frame and may freely use volatile registers without a prologue.
Building
These are GAS-syntax .S files. Assemble with an AArch64 toolchain, for example:
# native on an aarch64 host
as -o mat4_mul.o mat4_mul.S
# cross from x86_64
aarch64-linux-gnu-gcc -c -march=armv8-a+simd mat4_mul.S -o mat4_mul.o
# tuned scheduling (if you let the assembler/peephole help)
aarch64-linux-gnu-gcc -c -mcpu=cortex-a53 simple_linear_regression.S -o slr.o
There is no driver or test harness in the repo. Each file exports a symbol meant to be linked
into a larger program. The functions that call out (dfs.S calls visit) expect you to
provide the callee.
Routine index
| File | Symbol | Kind | One-liner |
|---|---|---|---|
simple_linear_regression.S |
simple_linear_regression_neon8 |
NEON | Least-squares slope/intercept over float arrays, 8-wide unrolled with prefetch plus 4-wide plus scalar tails. |
kmeans.S |
kmeans_assign |
NEON | k-means assignment step, labels n 3-D points by nearest of 3 centroids (squared L2), branchless argmin via bit. |
mat4_mul.S |
mat4_mul |
NEON | 4x4 single-precision matrix multiply, column-major, broadcast-fmla. |
dfs.S |
dfs_preorder_iter |
Stack/ABI | Iterative pre-order tree traversal using a stack-allocated (VLA / alloca-style) explicit stack. |
invert_binary_tree.S |
invert_binary_tree |
Stack/ABI | Recursive binary-tree mirror, canonical callee-saved plus frame example. |
invert_linked_list.S |
invert_linked_list |
Stack/ABI | In-place singly-linked-list reversal, leaf function, zero stack usage. |
NEON routines: layout rationale
simple_linear_regression.S (simple_linear_regression_neon8)
int simple_linear_regression_neon8(
const float* x, // x0
const float* y, // x1
size_t n, // x2
float* out_slope, // x3
float* out_intercept); // x4
// returns 0 on success, -1 if n <= 1, -2 if the denominator is zero
Computes the four running sums a least-squares fit needs (sum_x, sum_y, sum_xy,
sum_x2), then finishes with the closed-form slope and intercept. The structure is a
three-stage descent: an 8-wide vector main loop, a single 4-wide vector cleanup, and a
scalar tail.
Why it's laid out this way:
-
8 elements per iteration with paired A/B accumulators. Each of the four sums has two independent accumulators (
v0/v1forsum_x,v2/v3forsum_y,v4/v5forsum_xy,v6/v7forsum_x2). The "A" chain works on lanesi..i+3, the "B" chain oni+4..i+7. A single accumulator would force everyfadd/fmlato wait on the previous one (NEON FP latency about 4 cycles, throughput about 1 per cycle), so the chain would run at roughly 1/4 of peak. Two independent chains let the pipeline keep both in flight, roughly doubling utilization. The two partials are summed once, after the loop (A + B), so the per-iteration cost stays minimal. -
All four loads issued up front. The
ld1forxA,xB,yA,yBare grouped before any arithmetic. Because the A53 is in-order, putting the loads first gives the load/store unit a head start. By the time thefadd/fmlathat consume them are reached, the data is more likely to be ready, instead of stalling the whole pipeline. -
prfm pldl1keepabout 256 bytes ahead on both streams. This is a pure streaming kernel, and with only a few outstanding misses possible (see the A53 notes above), the prefetch hides L2/memory latency that the hardware prefetcher would not fully cover. -
fmlafor the products.sum_xyandsum_x2are fused multiply-accumulates: one instruction, one rounding, and it keeps the accumulator dependency local to a single op. -
Tail handling without re-doing work. After the main loop,
n % 8can still hold one full group of 4, processed vectorized invec4_setup, then the final0..3elements run scalar. The fall-through is deliberate. If the main loop was skipped (n < 8) the merged sums inv0/v2/v4/v6are simply the zero-initialized partials, so the 4-wide stage is still correct. -
Hazard-free horizontal reduction. The
faddpreductions write into disjoint scratch (v20-v23), and only then are the valuesfmov'd into the scalar registers the final math uses. Sourcing the move from a register other than the one just written avoids an immediate read-after-write stall on the NEON-to-scalar transfer. -
Leaf-clean scratch in the final math. The closed-form slope and intercept use
s6/s7ands16-s18as scratch, all caller-saved. It deliberately avoidss8-s15(the low halves ofv8-v15, which are callee-saved under AAPCS64), so the routine stays a true leaf with no prologue or epilogue and never corrupts a caller'sd8-d15.
kmeans.S (kmeans_assign)
void kmeans_assign(const float* xs, // x0 point x coords (n floats)
const float* ys, // x1 point y coords
const float* zs, // x2 point z coords
const float* cents, // x3 9 floats: c0x,c0y,c0z, c1.., c2..
uint32_t* labels, // x4 n outputs, nearest centroid index 0..2
size_t n); // x5 point count, must be >= 4
The assignment step of k-means for 3 centroids over 3-D points. For every point, find the
nearest of 3 centroids by squared Euclidean distance and write its index. This is the
assignment half of Lloyd's algorithm; the centroid-update half is a separate concern and is not
included here. The per-4-point distance and argmin kernel, the original "heart" of the file, is
now a .macro (KM_KERNEL) driven by a streaming loop.
Why it's laid out this way:
-
SoA (structure-of-arrays) lane layout. X, Y, Z each live in their own array, loaded into
v0/v1/v2with one point per lane. This makes the distance computation pure vertical SIMD, with no shuffles and no cross-lane work, so 4 points are handled in lockstep per iteration. -
Centroids broadcast once, before the loop. Nine
ld1rsplat each centroid coordinate across all four lanes ofv31..v23. Hoisting this out of the loop means the hot path is nothing butfsub/fmul/fmlawith nodupin the critical path. The argmin constants1and2(v21/v22) are likewise materialized once. Because the whole routine is a leaf that makes no calls, these live happily in caller-savedv16-v31for its entire duration. -
Loads issued up front. The three
ld1s for x/y/z lead each iteration, giving the in-order load/store unit a head start before the kernel'sfsubs are reached. -
fmulthen twofmlaper distance.dx^2seeds the accumulator withfmul, then+ dy^2and+ dz^2are fused. Distances for the three centroids land in separate registers (v16,v17,v18), so the three blocks are independent and can interleave in the pipeline rather than chaining through one accumulator. -
Branchless argmin via
bit. Instead of per-lane branches, which would defeat SIMD, the selection is a compare-and-blend.fcmgtbuilds a per-lane mask of "this distance beats the current best", andbit(bitwise insert-if-true) conditionally updates both the running index (v20) and the running best distance (v19). Critically, the index is updated first, using the still-valid mask, beforebestis overwritten. Otherwise the second update would key off the wrong comparison. -
Squared distance, no
sqrt. The argmin of distance and of distance-squared are identical, becausesqrtis monotonic, so the routine skips fourfsqrts per point. -
Overlapping vector tail. The main loop runs
n / 4full groups. Any remainder (n & 3, that is 1 to 3 points) is handled by rewinding the pointers so one final 4-wide group covers the last 4 elements. The overlapped lanes are recomputed and rewritten with identical values, which is cheaper and simpler than a scalar tail and stays branch-light. This is why the contract requiresn >= 4.
Verified against a scalar C reference for n = 4..23, covering every n & 3 remainder.
mat4_mul.S (mat4_mul)
void mat4_mul(const float* A, // x0 column-major 4x4
const float* B, // x1 column-major 4x4
float* C); // x2 = A * B
A classic broadcast-fmla 4x4 multiply, column-major, where each q register is one
column.
Why it's laid out this way:
-
Whole matrices loaded with
ldp q,q. Twoldpper matrix pull all four columns of A (v31..v28) and B (v27..v24) in just four instructions, issued up front so the load unit runs ahead of the arithmetic. -
A column of C is a linear combination of A's columns. Each output column
C[:,j]is the sum overkofA[:,k] * B[k,j]. With B column-major,B[k,j]is lanekof B's columnj, so the code uses the broadcast formfmla v, A_col, B_col.s[k]: multiply A's column by a single scalar lane of B, broadcast across all 4 lanes. No transposes, no horizontal adds. -
Two output columns computed in parallel. Columns 0 and 1 accumulate into
v0andv1with theirfmlas interleaved, so the two independent dependency chains overlap. The first pairC[:,0..1]is stored before the second pair is computed, freeing registers and spacing out the work. -
Leaf function, no prologue. It only touches
v0-v3andv24-v31, all caller-saved, and makes no calls, so no frame setup is needed.
Stack / recursion / ABI routines
dfs.S (dfs_preorder_iter, variable-length stack allocation, alloca-style)
// x0: root node ptr x1: stack slot count (capacity)
// node layout: left ptr at offset 0, right ptr at offset 8
// calls out to visit(node) for each node, pre-order
Iterative pre-order DFS that keeps its own explicit work-stack in a stack-allocated
variable-length buffer, the assembly equivalent of alloca or a C99 VLA. This is the example
to study for "how do I carve scratch space off the stack at runtime?"
Key mechanics:
- Frame setup with a saved FP. The prologue saves
x29/x30, the callee-saved registers it uses (x19,x20,x21), and setsx29 = sp. Keeping the frame pointer is what makes the VLA cleanup trivial. - Runtime-sized allocation.
slots * 8, rounded up to a 16-byte multiple (+15, & ~15), is subtracted fromsp. This keepssp16-byte aligned regardless of the requested count, the rule you must never violate. - The explicit stack lives in that buffer.
x20is its base,x19its moving top.str x0, [x19], #8pushes (post-increment), andldr x21, [x19, #-8]!pops (pre-decrement). Emptiness is detected by comparing the top against the base. - Pre-order with a LIFO twist. The right child is pushed before the left, so the left is
popped first, which yields node-before-children order. The popped node is kept in
x21, a callee-saved register, precisely because it must survive thebl visitcall. - O(1) teardown.
mov sp, x29discards the entire VLA in a single instruction, with no need to track how much was allocated, then the saved registers are restored.
The caller must size the buffer for the maximum simultaneous stack occupancy, which for a push-both-children traversal is bounded by the number of pending nodes, not strictly the tree's depth.
invert_binary_tree.S (invert_binary_tree, textbook recursion)
node* invert_binary_tree(node* root); // x0 in / x0 out; left@0, right@8
Mirrors a binary tree by swapping every node's children. The cleanest example of a recursive non-leaf prologue and epilogue:
- Saves
x29/x30and one callee-saved registerx19in a 32-byte aligned frame. - Caches the current node in
x19becausex0(volatile) is destroyed by the recursivebls. The swap itself is a neatldp x1,x0,[x0]thenstp x0,x1,[x19]: load both children and store them back in swapped order in two instructions. - After recursing left, it reloads the (now-swapped) right child from
x19rather than trusting any volatile register to have survived the first recursive call. That is the single most common recursion bug, made explicit here.
invert_linked_list.S (invert_linked_list, leaf, zero stack)
node* invert_linked_list(node* head); // x0 in / new head out; next ptr @0
In-place singly-linked-list reversal. Included as the contrast case: it is a leaf
function that needs no stack frame at all. The classic three-pointer walk
(prev/curr/next) lives entirely in volatile registers (x1, x2, x3), so there is
nothing to save and the prologue and epilogue collapse to a single ret. When a routine makes
no calls and fits in volatile registers, this is what you want. Don't pay for a frame you don't
need.