⚙️ How Compilers Actually Optimize Your Code
From C source to x86 assembly — a tour through the compiler pipeline, the key optimization passes, and how to read the output of -O3 with GCC and Clang.
🏗️ The Compiler Pipeline
Your source code goes through four stages before becoming executable machine code:
Source (C/Rust/Go)
→ Frontend (lexing, parsing, semantic analysis)
→ IR (Intermediate Representation)
→ Optimizer (repeated passes over IR)
→ Backend (instruction selection, register allocation)
→ Codegen (machine code / assembly)
The Intermediate Representation (IR) is the crucial abstraction — a language- and architecture-independent form that the optimizer can reason about mathematically. Both GCC (GIMPLE → RTL) and LLVM/Clang (LLVM-IR) use multi-level IRs.
Godbolt Compiler Explorer lets you explore this pipeline interactively: write C or Rust on the left, and see the resulting assembly on the right, annotated with source lines.
📐 SSA Form: The Foundation
Most modern compilers convert IR into Static Single Assignment (SSA) form before optimization. In SSA, every variable is assigned exactly once, and new versions are created with numbered subscripts:
// Original:
x = 1
x = x + 5
y = x * 2
// SSA Form:
x₁ = 1
x₂ = x₁ + 5
y₁ = x₂ * 2
This seemingly simple transformation is the foundation for almost every optimization. With SSA, the def-use chain — the relationship between where a value is defined and where it is used — becomes explicit in the IR itself, eliminating expensive data-flow analysis.
At merge points (if-else, loops), SSA uses φ (phi) functions to select the correct version:
// SSA with a branch:
if (cond):
x₁ = 1
else:
x₂ = 2
x₃ = φ(x₁, x₂) // Picks x₁ or x₂ depending on which path was taken
return x₃
🔧 Key Optimization Passes
1. Constant Folding
The compiler evaluates constant expressions at compile time, not runtime:
// Source:
int x = 60 * 60 * 24; // seconds in a day
// After constant folding — compiler replaces with:
int x = 86400;
In LLVM-IR:
; Before:
%0 = mul i32 60, 60 ; 60 * 60 = 3600
%1 = mul i32 %0, 24 ; 3600 * 24 = 86400
; After constant folding:
%1 = i32 86400
2. Dead Code Elimination (DCE)
Removes code that computes values never used:
int compute(int x) {
int a = x * 2; // DEAD — a is never used
int b = x * 3;
return b;
}
// Optimized:
int compute(int x) {
return x * 3;
}
DCE is especially powerful combined with other passes — inlining may reveal that a function’s return value is unused, triggering removal of the entire call.
3. Common Subexpression Elimination (CSE)
If the same value is computed twice, the compiler hoists the computation:
int expensive(int x, int y) {
int a = (x * y) + 42; // First compute x*y
int b = (x * y) + 99; // Same x*y computed again
return a + b;
}
// After CSE:
int expensive(int x, int y) {
int t = x * y; // Compute once
int a = t + 42;
int b = t + 99;
return a + b;
}
4. Loop Invariant Code Motion (LICM)
Computations that produce the same result on every loop iteration are moved outside:
// Before:
for (int i = 0; i < n; i++) {
arr[i] = arr[i] * sqrt(2.0); // sqrt(2.0) computed n times!
}
// After LICM:
double factor = sqrt(2.0); // Computed once
for (int i = 0; i < n; i++) {
arr[i] = arr[i] * factor;
}
The compiler can only do this when it can prove sqrt(2.0) has no side effects — a category of analysis called aliasing and side-effect analysis.
5. Inlining
Replaces a function call with the function body:
// Before:
int square(int x) { return x * x; }
int result = square(42);
// After inlining:
int result = 42 * 42;
Inlining eliminates call overhead and, critically, enables downstream optimizations — the inlined code may now participate in constant folding, CSE, and DCE globally.
6. Vectorization (SIMD)
The compiler transforms scalar operations to use Single Instruction, Multiple Data (SIMD) instructions — operating on 4, 8, or 16 values simultaneously:
// Simple loop — compiler produces SIMD:
void add_arrays(float *a, float *b, float *c, int n) {
for (int i = 0; i < n; i++) {
c[i] = a[i] + b[i];
}
}
On x86-64 with -O3 -march=native, Clang may generate:
; Using 256-bit YMM registers (AVX2):
vmovups ymm0, ymmword ptr [rdi + rcx] ; Load 8 floats from a
vaddps ymm0, ymm0, ymmword ptr [rsi + rcx] ; Add 8 floats from b
vmovups ymmword ptr [rdx + rcx], ymm0 ; Store 8 floats to c
add rcx, 32 ; Advance 32 bytes (8 floats)
cmp rcx, rax
jb .LBB0_1
The loop processes 8 floats per iteration instead of 1 — a theoretical 8× speedup.
7. Strength Reduction
Replaces expensive operations with cheaper equivalents:
// Multiplication by constant → shift + add:
int x = y * 2; → int x = y << 1; // 1 cycle vs 3 cycles
// Division by constant → multiply + shift:
int x = y / 8; → int x = y >> 3;
// Array indexing → pointer arithmetic:
arr[i] = 5; → *(arr + i * sizeof(int)) = 5;
In assembly, a multiplication by 2 becomes:
; Source: return x * 2;
; Naive (without strength reduction):
imul eax, edi, 2 ; 3 cycles, 4 bytes
; After strength reduction:
lea eax, [rdi + rdi] ; 1 cycle, 3 bytes — no multiply at all!
The LEA (Load Effective Address) instruction is the optimizer’s favorite trick — it can compute a + b*scale + offset without a separate arithmetic instruction.
📊 GCC vs Clang Optimization Levels
| Level | Name | What It Enables |
|---|---|---|
-O0 | None | Fastest compile, no optimization. Debug-friendly. |
-O1 | Optimize | Basic passes: CSE, DCE, constant folding. Good ratio of compile speed to runtime. |
-O2 | Optimize more | All -O1 + inlining, LICM, alias analysis. Default for most projects. |
-O3 | Optimize most | All -O2 + aggressive inlining, vectorization, loop unrolling. May increase code size. |
-Os | Optimize size | -O2 minus transformations that increase code size. Critical for embedded systems. |
-Oz | Optimize size aggressively | Even smaller than -Os. May use slower but smaller instruction sequences. |
Concrete Example: Counting Loop
// sum.c
int sum(int *arr, int n) {
int total = 0;
for (int i = 0; i < n; i++) {
total += arr[i];
}
return total;
}
-O0 (no optimization) — literal translation:
sum:
mov dword ptr [rsp-4], 0 ; total = 0
mov dword ptr [rsp-8], 0 ; i = 0
jmp .Lcond
.Lloop:
movsxd rax, dword ptr [rsp-8]
lea rcx, [rdi + rax*4]
add ecx, dword ptr [rcx]
mov dword ptr [rsp-4], ecx
add dword ptr [rsp-8], 1 ; i++
.Lcond:
cmp dword ptr [rsp-8], esi
jl .Lloop
mov eax, dword ptr [rsp-4]
ret
-O3 — Clang produces:
sum:
test esi, esi
jle .Lempty
lea rcx, [rdi + 4*rsi] ; End pointer
xor eax, eax ; total = 0
.Lloop:
add eax, dword ptr [rdi] ; total += arr[i]
add rdi, 4 ; ptr++ (not i++)
cmp rdi, rcx
jb .Lloop
ret
.Lempty:
xor eax, eax
ret
Key differences at -O3:
- No stack spills — everything in registers
- Pointer arithmetic replaces indexing —
rdiwalks through memory, no multiply needed i++becomes pointer comparison — no separate counter- Loop inverted — test at bottom, one less jump
With -O3 -march=native (AVX2 available):
sum:
test esi, esi
jle .Lempty
xor eax, eax
cmp esi, 7
jl .Lscalar ; < 8 elements → scalar
...
; Vectorized: process 8 ints at a time!
vpaddd ymm0, ymm1, ymmword ptr [rdi + 4*rax]
...
🔄 The Phase Ordering Problem
Optimization passes interact in complex ways. Running passes in different orders can produce different results:
// Example where pass order matters:
int f(int x) {
return (x * 2) + (x * 2);
}
// Order A: CSE first, then constant folding
// CSE: t = x * 2; return t + t;
// CF: return x * 4; ← Better
// Order B: Constant folding first, then CSE
// CF: no change (no constant operands)
// CSE: t = x * 2; return t + t; ← Missed opportunity
Modern compilers use pass pipelines tuned experimentally — LLVM uses a fixed pipeline with some heuristic reordering, while GCC’s -O2 and -O3 sequences have been refined over 30+ years.
📈 Profile-Guided Optimization (PGO)
PGO uses runtime profiling to inform optimization:
Step 1: Compile with instrumentation
$ clang -fprofile-instr-generate -O2 program.c -o program
Step 2: Run against representative workloads
$ ./program < large_input.txt
Step 3: Re-compile with profile data
$ llvm-profdata merge -output=code.profdata default.profraw
$ clang -fprofile-instr-use=code.profdata -O2 program.c -o program
The profile tells the compiler:
- Which branches are taken (enables better branch layout, block placement)
- Which functions are hot (more aggressive inlining)
- Which loops iterate many times (more unrolling, vectorization)
Google reported 15–40% speedups across their C++ codebase after deploying PGO with LLVM.
🔗 Link-Time Optimization (LTO)
Traditional compilation optimizes one file at a time — missing opportunities across translation unit boundaries:
// util.c
int add(int a, int b) { return a + b; }
// main.c
#include "util.h"
int main() { return add(2, 3); }
Without LTO, add can only be inlined if the compiler sees both files simultaneously. With LTO (-flto), the compiler stores IR in object files and runs global optimization at link time — enabling cross-file inlining, dead code elimination across the entire program, and whole-program alias analysis.
🎯 Takeaways
| Optimization | Typical Speedup | When It Fires |
|---|---|---|
| Constant folding | 1–5% | Any constant expression |
| DCE | 1–10% | Unused code, dead stores |
| CSE | 5–15% | Repeated subexpressions |
| LICM | 10–50%+ | Loop-heavy code |
| Inlining | 10–30% | Small, frequently-called functions |
| Vectorization | 2–8× | Data-parallel loops |
| PGO | 15–40% | Any hot-path code |
| LTO | 5–20% | Multi-file projects |
The golden rule: Write clear, idiomatic code and trust the optimizer. Don’t hand-optimize unless profiler evidence demands it — the compiler is better at micro-optimizations than humans, and readable code enables more optimizations because the compiler can reason about it more precisely.
📚 Further Reading
| Resource | Link |
|---|---|
| LLVM Pass Documentation | llvm.org/docs/Passes.html |
| Compiler Explorer (Godbolt) | godbolt.org |
| A Tour Through GCC’s Optimization Levels | godbolt.org/z/TTTT |
| Phase Ordering in LLVM | LLVM Dev Talk |
| ”Optimizing Compilers for Modern Architectures” (Allen & Kennedy) | Textbook, 2001 |