Optimization roadmap — pulling ahead of the competition
This is a research-grounded plan for making Symbolic the fastest and most correct from-scratch compiler in its class. Every claim here is tied to a measurement in this tree or to the cited literature, and every proposal is mapped onto Symbolic's actual architecture (shared target-independent IR → per-target hand-written encoders, no LLVM).
Where we are today (measured)
| Lever | Status | Evidence |
|---|---|---|
| Constant fold + propagate, DCE, copy-prop | shipped (-O1) |
sigil/runes/ir/src/main.sym:cfold/dce/melim |
Algebraic identities (x+0, x*1, x*0, …) |
shipped (-O2) |
:simp1/:simppass |
Strength reduction (x*2^k → x<<k) |
shipped (-O3) |
:sred1/:sredpass |
| Iterated pipeline (2 rounds O2, 3 rounds O3) | shipped | :runp |
| x64 redundant-move peephole | shipped (on at -O≥1) |
:mvrr/:mvrm/:mvmr + #rxk |
Immediate-form operands (add $imm, cmp $imm, shl $imm) |
shipped | :embini/:imok1/:imfit32 |
Cross-language micro-bench (64-bit LCG+xorshift, the headline number):
~3.7x gcc -O2 → ~1.6x after the peephole + -O2. Deterministic
instruction count on the same loop dropped 90.0M → 81.0M (-10%) with
immediate forms; the self-hosted compiler shrank 622 KB → 540 KB.
The bottleneck (the most important finding)
-O3 does not beat -O2 on the LCG, and immediate forms cut 10% of
instructions without moving the wall clock. Both facts have one root cause:
The register allocator (
sigil/runes/back/src/main.sym:ralloc) is naive. Despite the "linear-scan" comment it does no live-range analysis. It walks VRegs in numeric id order, hands out the 9 GPRs until they run out, then spills everything else to the stack for the whole function. Worse: if a function contains any call (has_calls), the pool is cut to the 5 callee-saved registers for the entire body.
Consequences, all visible in the LCG disassembly:
- The xorshift temporary
shis spilled to-0x8(%rbp)and reloaded one instruction later — a store→load-forward (~5 cycles) on the recurrence's critical path. That latency, not instruction count, is what we measure. - Because allocation is by id (not liveness), two temporaries with disjoint live ranges never share a register, so the 9-register pool is exhausted far too early.
- Any function with a
:bint/:wrchcall drops to 5 registers globally, even for code regions nowhere near the call.
Fixing this is the single highest-leverage change available and it gates the value of almost everything else (scheduling, unrolling, inlining all assume values live in registers — see Microsoft's optimization guide and the register-allocation/scheduling literature below).
Catalogue of optimizations across real compilers → our gap
Grouped by where they live in our pipeline. ✅ have · 🟡 partial · ❌ missing.
IR / mid-end (target-independent — every backend benefits)
- ✅ Constant folding & propagation, DCE, copy propagation
- ✅ Algebraic simplification, strength reduction (mul→shift)
- 🟡 Common-subexpression elimination (GVN/CSE) — missing the global form. Local value numbering within a basic block is cheap and high-value.
- ❌ Loop-invariant code motion (LICM) — hoist computations that don't change across iterations out of the loop.
- ❌ Loop unrolling — amortize the loop test/branch; exposes ILP. Classic
-O3transform (per GCC/Clang). - ❌ Induction-variable simplification — replace
i*stridewith an additive accumulator. - ❌ Function inlining — the highest-impact single optimization in the literature: typically 10–20% on C-like code (Microsoft, Walfridsson). Symbolic has cheap, frequent small functions (the rune model) so inlining small leaf calls should pay off disproportionately.
- ❌ Tail-call elimination — turn self-tail-recursion into a loop (helps the recursion examples; avoids stack growth).
- ❌ Branch/
Cmp+JumpCondfusion at the IR level (we still emitcmp; setcc; movzx; test; jccinstead ofcmp; jcc). - ❌ Jump threading / basic-block merging / unreachable-block elimination.
Back-end / codegen (per target, but the pattern is shared)
- ✅ Redundant-move peephole, immediate-form operands
- 🔴 Register allocation — the bottleneck above. Replace the id-order allocator with a real linear-scan allocator over computed live intervals (Poletto & Sarkar; used by V8, HotSpot client, ART — near graph-coloring quality at a fraction of the cost, ideal for a self-hosting compiler that must stay fast).
- ❌ Instruction scheduling — reorder to hide latency (e.g. issue the independent loop-counter update while the multiply is in flight). Most valuable after good regalloc; integrating the two is a known win.
- ❌ lea-based arithmetic — fold
x*const ± yinto onelea. - ❌ Condition-flag reuse — skip the
setcc/movzx/testround-trip when a compare directly feeds a branch. - ❌ Two-operand peepholes:
add $1 → inc, zero viaxor r,r,mov $0 → xor.
Whole-program / profile
- ❌ Profile-guided optimization (PGO) — hot/cold splitting, block layout.
- ❌ Link-time / whole-program inlining (we already compile whole-program, so this is "free" structurally).
- 🟡 SIMD/vectorization — only via the explicit SSE/x87 float path today; no auto-vectorizer.
Proposals, prioritized by impact ÷ effort
P0 — Real linear-scan register allocation ✅ DONE (the big one)
Shipped. sigil/runes/back/src/main.sym:ralloc now computes live intervals
(:ivins/:ivd), extends them across loop back-edges (:lxtend — the
correctness fix for loop-carried/loop-invariant values), and linear-scans with
register reuse + furthest-end spill. Pool keeps callee-saved popped first so the
prologue's param binding can never clobber an incoming argument register. Result:
the LCG hot loop is now spill-free and the bench went 1.9x → 1.18x gcc -O2
(2.23 ns/iter); the self-hosted compiler shrank 540 KB → 512 KB. Original plan:
P0 (original) — Real linear-scan register allocation (unlocks everything)
Compute live intervals (first-def … last-use index per VReg over the linear instruction list), then linear-scan assign physical registers, spilling the interval with the furthest end on pressure (Poletto/Sarkar). Refinements:
- Track call sites precisely: only values live across a call are forced to callee-saved or spilled — not the whole function. This alone removes the global 5-register cap.
- Prefer keeping short critical-path temporaries (like
sh) in registers.
Expected: eliminates the LCG stack spill → critical path ~14→~7 cycles →
roughly 2x on the LCG (~1.6x C → ~0.8–0.9x C, i.e. at/under gcc -O2), and
broad gains everywhere. This is target-independent bookkeeping feeding each
backend's existing #rk/#rv, so all backends benefit at once.
P1 — Branch fusion + flag reuse
When a Cmp (tag 7) is immediately consumed by a JumpCond (tag 8) on the same
value, emit cmp; jcc directly and drop the setcc/movzx/test. Saves 3–4
instructions per branch — every loop and if. Implement as an IR peephole
(fuse adjacent Cmp+JumpCond into a fused node) so all backends share it.
P2 — Function inlining (mid-end)
Inline small/leaf user functions (size threshold; the call graph is already
known — #funcs + #calls). Renumber the callee's VRegs into the caller and
splice its body, then let cfold/dce/copy-prop/regalloc clean up. 10–20%
expected on call-heavy code; with the rune model, likely more. Pairs with P0
(inlined bodies need registers to pay off). Add at -O2.
P3 — LICM + loop unrolling (loop opts at -O3)
Detect natural loops (back-edge to a dominating label), hoist invariant
sub-expressions, and unroll small bodies by 2–4×. This is the canonical reason
-O3 > -O2 in production compilers; it would finally make our -O3 pull ahead
of -O2 instead of tying it.
P4 — Local value numbering (CSE)
Within a basic block, hash (op, operands) and reuse prior results. Cheap,
complements copy-prop, and removes repeated address/index math.
P5 — Instruction scheduling
After P0, list-schedule each block to hide multiply/load latency. Integrate lightly with regalloc to avoid re-introducing spills (the integrated approach is the documented best practice).
Making -O3 earn its name
Today -O3 = -O2 + strength reduction + an extra pass round. The plan above
gives -O3 real teeth by gating the loop optimizations there:
-O0 none
-O1 fold/prop, DCE, copy-prop (+ peephole, imm-forms)
-O2 -O1 + algebraic simplification + CSE + inlining + branch fusion, iterated
-O3 -O2 + LICM + loop unrolling + induction-var simplification + scheduling
With linear-scan regalloc (P0) under all of them, -O2 should reach parity with
gcc -O2 on integer code and -O3 should pass it on loop-heavy code.
Correctness — staying ahead while getting faster
Speed must never cost correctness; the moat is that every change is proven end-to-end:
- The self-host fixpoint (
symc --no-std < sigil/runes/symc/src/main.sym== shipped dist, byte-for-byte) is the strongest possible regression test — a miscompiled optimization cannot reproduce the 540 KB compiler. - The example suite runs byte-exact against one shared oracle on every backend (native x64, qemu arm64/riscv64/loongarch64/android, wasmtime, lavapipe SPIR-V, QEMU+OVMF UEFI, a real FreeBSD VM) and through the LLVM fallback — so an optimization that breaks any ISA is caught immediately.
- Run the full suite at
-O0/-O1/-O2/-O3(already wired) on every change; differential testing across opt levels catches pass bugs. - Additions to harden the moat: a randomized differential fuzzer (generate
programs, require identical output at all
-Olevels and across backends) and per-pass callgrind instruction-count gates in CI so a "speedup" that regresses instruction count fails the build.
Sources
- Poletto & Sarkar, Linear Scan Register Allocation, ACM TOPLAS — the algorithm behind V8 / HotSpot-client / ART regalloc. https://dl.acm.org/doi/10.1145/330249.330250
- Quality and speed in linear-scan register allocation, PLDI'98. https://dl.acm.org/doi/10.1145/277650.277714
- Sarkar & Barik, Extended Linear Scan, CC'07 (15–68× faster than graph coloring at comparable quality). https://www.cs.rice.edu/~vs3/PDF/cc2007.pdf
- Microsoft, What Every Programmer Should Know About Compiler Optimizations (inlining as the top single optimization; regalloc enabling the rest). https://learn.microsoft.com/en-us/archive/msdn-magazine/2015/february/compilers-what-every-programmer-should-know-about-compiler-optimizations
- Integrated Instruction Scheduling and Register Allocation Techniques. https://www.cs.virginia.edu/~soffa/Soffa_Pubs_all/Conferences/Integrated.Berson.1998.pdf
- Walfridsson, Inlining and microbenchmarking (why inlining wins, measurement caveats). http://kristerw.blogspot.com/2018/08/inlining-and-microbenchmarking.html