Runes

Optimization roadmap — pulling ahead of the competition

This is a research-grounded plan for making Symbolic the fastest and most correct from-scratch compiler in its class. Every claim here is tied to a measurement in this tree or to the cited literature, and every proposal is mapped onto Symbolic's actual architecture (shared target-independent IR → per-target hand-written encoders, no LLVM).

Where we are today (measured)

Lever Status Evidence
Constant fold + propagate, DCE, copy-prop shipped (-O1) sigil/runes/ir/src/main.sym:cfold/dce/melim
Algebraic identities (x+0, x*1, x*0, …) shipped (-O2) :simp1/:simppass
Strength reduction (x*2^k → x<<k) shipped (-O3) :sred1/:sredpass
Iterated pipeline (2 rounds O2, 3 rounds O3) shipped :runp
x64 redundant-move peephole shipped (on at -O≥1) :mvrr/:mvrm/:mvmr + #rxk
Immediate-form operands (add $imm, cmp $imm, shl $imm) shipped :embini/:imok1/:imfit32

Cross-language micro-bench (64-bit LCG+xorshift, the headline number): ~3.7x gcc -O2 → ~1.6x after the peephole + -O2. Deterministic instruction count on the same loop dropped 90.0M → 81.0M (-10%) with immediate forms; the self-hosted compiler shrank 622 KB → 540 KB.

The bottleneck (the most important finding)

-O3 does not beat -O2 on the LCG, and immediate forms cut 10% of instructions without moving the wall clock. Both facts have one root cause:

The register allocator (sigil/runes/back/src/main.sym:ralloc) is naive. Despite the "linear-scan" comment it does no live-range analysis. It walks VRegs in numeric id order, hands out the 9 GPRs until they run out, then spills everything else to the stack for the whole function. Worse: if a function contains any call (has_calls), the pool is cut to the 5 callee-saved registers for the entire body.

Consequences, all visible in the LCG disassembly:

  • The xorshift temporary sh is spilled to -0x8(%rbp) and reloaded one instruction later — a store→load-forward (~5 cycles) on the recurrence's critical path. That latency, not instruction count, is what we measure.
  • Because allocation is by id (not liveness), two temporaries with disjoint live ranges never share a register, so the 9-register pool is exhausted far too early.
  • Any function with a :bint/:wrch call drops to 5 registers globally, even for code regions nowhere near the call.

Fixing this is the single highest-leverage change available and it gates the value of almost everything else (scheduling, unrolling, inlining all assume values live in registers — see Microsoft's optimization guide and the register-allocation/scheduling literature below).

Catalogue of optimizations across real compilers → our gap

Grouped by where they live in our pipeline. ✅ have · 🟡 partial · ❌ missing.

IR / mid-end (target-independent — every backend benefits)

  • ✅ Constant folding & propagation, DCE, copy propagation
  • ✅ Algebraic simplification, strength reduction (mul→shift)
  • 🟡 Common-subexpression elimination (GVN/CSE)missing the global form. Local value numbering within a basic block is cheap and high-value.
  • Loop-invariant code motion (LICM) — hoist computations that don't change across iterations out of the loop.
  • Loop unrolling — amortize the loop test/branch; exposes ILP. Classic -O3 transform (per GCC/Clang).
  • Induction-variable simplification — replace i*stride with an additive accumulator.
  • Function inliningthe highest-impact single optimization in the literature: typically 10–20% on C-like code (Microsoft, Walfridsson). Symbolic has cheap, frequent small functions (the rune model) so inlining small leaf calls should pay off disproportionately.
  • ❌ Tail-call elimination — turn self-tail-recursion into a loop (helps the recursion examples; avoids stack growth).
  • ❌ Branch/Cmp+JumpCond fusion at the IR level (we still emit cmp; setcc; movzx; test; jcc instead of cmp; jcc).
  • ❌ Jump threading / basic-block merging / unreachable-block elimination.

Back-end / codegen (per target, but the pattern is shared)

  • ✅ Redundant-move peephole, immediate-form operands
  • 🔴 Register allocationthe bottleneck above. Replace the id-order allocator with a real linear-scan allocator over computed live intervals (Poletto & Sarkar; used by V8, HotSpot client, ART — near graph-coloring quality at a fraction of the cost, ideal for a self-hosting compiler that must stay fast).
  • Instruction scheduling — reorder to hide latency (e.g. issue the independent loop-counter update while the multiply is in flight). Most valuable after good regalloc; integrating the two is a known win.
  • lea-based arithmetic — fold x*const ± y into one lea.
  • ❌ Condition-flag reuse — skip the setcc/movzx/test round-trip when a compare directly feeds a branch.
  • ❌ Two-operand peepholes: add $1 → inc, zero via xor r,r, mov $0 → xor.

Whole-program / profile

  • ❌ Profile-guided optimization (PGO) — hot/cold splitting, block layout.
  • ❌ Link-time / whole-program inlining (we already compile whole-program, so this is "free" structurally).
  • 🟡 SIMD/vectorization — only via the explicit SSE/x87 float path today; no auto-vectorizer.

Proposals, prioritized by impact ÷ effort

P0 — Real linear-scan register allocation ✅ DONE (the big one)

Shipped. sigil/runes/back/src/main.sym:ralloc now computes live intervals (:ivins/:ivd), extends them across loop back-edges (:lxtend — the correctness fix for loop-carried/loop-invariant values), and linear-scans with register reuse + furthest-end spill. Pool keeps callee-saved popped first so the prologue's param binding can never clobber an incoming argument register. Result: the LCG hot loop is now spill-free and the bench went 1.9x → 1.18x gcc -O2 (2.23 ns/iter); the self-hosted compiler shrank 540 KB → 512 KB. Original plan:

P0 (original) — Real linear-scan register allocation (unlocks everything)

Compute live intervals (first-def … last-use index per VReg over the linear instruction list), then linear-scan assign physical registers, spilling the interval with the furthest end on pressure (Poletto/Sarkar). Refinements:

  • Track call sites precisely: only values live across a call are forced to callee-saved or spilled — not the whole function. This alone removes the global 5-register cap.
  • Prefer keeping short critical-path temporaries (like sh) in registers.

Expected: eliminates the LCG stack spill → critical path ~14→~7 cycles → roughly 2x on the LCG (~1.6x C → ~0.8–0.9x C, i.e. at/under gcc -O2), and broad gains everywhere. This is target-independent bookkeeping feeding each backend's existing #rk/#rv, so all backends benefit at once.

P1 — Branch fusion + flag reuse

When a Cmp (tag 7) is immediately consumed by a JumpCond (tag 8) on the same value, emit cmp; jcc directly and drop the setcc/movzx/test. Saves 3–4 instructions per branch — every loop and if. Implement as an IR peephole (fuse adjacent Cmp+JumpCond into a fused node) so all backends share it.

P2 — Function inlining (mid-end)

Inline small/leaf user functions (size threshold; the call graph is already known — #funcs + #calls). Renumber the callee's VRegs into the caller and splice its body, then let cfold/dce/copy-prop/regalloc clean up. 10–20% expected on call-heavy code; with the rune model, likely more. Pairs with P0 (inlined bodies need registers to pay off). Add at -O2.

P3 — LICM + loop unrolling (loop opts at -O3)

Detect natural loops (back-edge to a dominating label), hoist invariant sub-expressions, and unroll small bodies by 2–4×. This is the canonical reason -O3 > -O2 in production compilers; it would finally make our -O3 pull ahead of -O2 instead of tying it.

P4 — Local value numbering (CSE)

Within a basic block, hash (op, operands) and reuse prior results. Cheap, complements copy-prop, and removes repeated address/index math.

P5 — Instruction scheduling

After P0, list-schedule each block to hide multiply/load latency. Integrate lightly with regalloc to avoid re-introducing spills (the integrated approach is the documented best practice).

Making -O3 earn its name

Today -O3 = -O2 + strength reduction + an extra pass round. The plan above gives -O3 real teeth by gating the loop optimizations there:

-O0  none
-O1  fold/prop, DCE, copy-prop                              (+ peephole, imm-forms)
-O2  -O1 + algebraic simplification + CSE + inlining + branch fusion, iterated
-O3  -O2 + LICM + loop unrolling + induction-var simplification + scheduling

With linear-scan regalloc (P0) under all of them, -O2 should reach parity with gcc -O2 on integer code and -O3 should pass it on loop-heavy code.

Correctness — staying ahead while getting faster

Speed must never cost correctness; the moat is that every change is proven end-to-end:

  • The self-host fixpoint (symc --no-std < sigil/runes/symc/src/main.sym == shipped dist, byte-for-byte) is the strongest possible regression test — a miscompiled optimization cannot reproduce the 540 KB compiler.
  • The example suite runs byte-exact against one shared oracle on every backend (native x64, qemu arm64/riscv64/loongarch64/android, wasmtime, lavapipe SPIR-V, QEMU+OVMF UEFI, a real FreeBSD VM) and through the LLVM fallback — so an optimization that breaks any ISA is caught immediately.
  • Run the full suite at -O0/-O1/-O2/-O3 (already wired) on every change; differential testing across opt levels catches pass bugs.
  • Additions to harden the moat: a randomized differential fuzzer (generate programs, require identical output at all -O levels and across backends) and per-pass callgrind instruction-count gates in CI so a "speedup" that regresses instruction count fails the build.

Sources