⚙️ How Compilers Actually Optimize Your Code

Pranesh Nikhar Feb 26, 2026 02/26/26 8 min read

 󰈤

From C source to x86 assembly — a tour through the compiler pipeline, the key optimization passes, and how to read the output of -O3 with GCC and Clang.

🏗️ The Compiler Pipeline

Your source code goes through four stages before becoming executable machine code:

Source (C/Rust/Go)
  → Frontend (lexing, parsing, semantic analysis)
    → IR (Intermediate Representation)
      → Optimizer (repeated passes over IR)
        → Backend (instruction selection, register allocation)
          → Codegen (machine code / assembly)

The Intermediate Representation (IR) is the crucial abstraction — a language- and architecture-independent form that the optimizer can reason about mathematically. Both GCC (GIMPLE → RTL) and LLVM/Clang (LLVM-IR) use multi-level IRs.

Godbolt Compiler Explorer lets you explore this pipeline interactively: write C or Rust on the left, and see the resulting assembly on the right, annotated with source lines.

📐 SSA Form: The Foundation

Most modern compilers convert IR into Static Single Assignment (SSA) form before optimization. In SSA, every variable is assigned exactly once, and new versions are created with numbered subscripts:

// Original:
x = 1
x = x + 5
y = x * 2

// SSA Form:
x₁ = 1
x₂ = x₁ + 5
y₁ = x₂ * 2

This seemingly simple transformation is the foundation for almost every optimization. With SSA, the def-use chain — the relationship between where a value is defined and where it is used — becomes explicit in the IR itself, eliminating expensive data-flow analysis.

At merge points (if-else, loops), SSA uses φ (phi) functions to select the correct version:

// SSA with a branch:
if (cond):
  x₁ = 1
else:
  x₂ = 2
x₃ = φ(x₁, x₂)  // Picks x₁ or x₂ depending on which path was taken
return x₃

🔧 Key Optimization Passes

1. Constant Folding

The compiler evaluates constant expressions at compile time, not runtime:

// Source:
int x = 60 * 60 * 24;  // seconds in a day

// After constant folding — compiler replaces with:
int x = 86400;

In LLVM-IR:

; Before:
%0 = mul i32 60, 60    ; 60 * 60 = 3600
%1 = mul i32 %0, 24    ; 3600 * 24 = 86400

; After constant folding:
%1 = i32 86400

2. Dead Code Elimination (DCE)

Removes code that computes values never used:

int compute(int x) {
    int a = x * 2;      // DEAD — a is never used
    int b = x * 3;
    return b;
}

// Optimized:
int compute(int x) {
    return x * 3;
}

DCE is especially powerful combined with other passes — inlining may reveal that a function’s return value is unused, triggering removal of the entire call.

3. Common Subexpression Elimination (CSE)

If the same value is computed twice, the compiler hoists the computation:

int expensive(int x, int y) {
    int a = (x * y) + 42;     // First compute x*y
    int b = (x * y) + 99;     // Same x*y computed again
    return a + b;
}

// After CSE:
int expensive(int x, int y) {
    int t = x * y;            // Compute once
    int a = t + 42;
    int b = t + 99;
    return a + b;
}

4. Loop Invariant Code Motion (LICM)

Computations that produce the same result on every loop iteration are moved outside:

// Before:
for (int i = 0; i < n; i++) {
    arr[i] = arr[i] * sqrt(2.0);  // sqrt(2.0) computed n times!
}

// After LICM:
double factor = sqrt(2.0);        // Computed once
for (int i = 0; i < n; i++) {
    arr[i] = arr[i] * factor;
}

The compiler can only do this when it can prove sqrt(2.0) has no side effects — a category of analysis called aliasing and side-effect analysis.

5. Inlining

Replaces a function call with the function body:

// Before:
int square(int x) { return x * x; }
int result = square(42);

// After inlining:
int result = 42 * 42;

Inlining eliminates call overhead and, critically, enables downstream optimizations — the inlined code may now participate in constant folding, CSE, and DCE globally.

6. Vectorization (SIMD)

The compiler transforms scalar operations to use Single Instruction, Multiple Data (SIMD) instructions — operating on 4, 8, or 16 values simultaneously:

// Simple loop — compiler produces SIMD:
void add_arrays(float *a, float *b, float *c, int n) {
    for (int i = 0; i < n; i++) {
        c[i] = a[i] + b[i];
    }
}

On x86-64 with -O3 -march=native, Clang may generate:

; Using 256-bit YMM registers (AVX2):
vmovups     ymm0, ymmword ptr [rdi + rcx]  ; Load 8 floats from a
vaddps      ymm0, ymm0, ymmword ptr [rsi + rcx]  ; Add 8 floats from b
vmovups     ymmword ptr [rdx + rcx], ymm0  ; Store 8 floats to c
add         rcx, 32                         ; Advance 32 bytes (8 floats)
cmp         rcx, rax
jb          .LBB0_1

The loop processes 8 floats per iteration instead of 1 — a theoretical 8× speedup.

7. Strength Reduction

Replaces expensive operations with cheaper equivalents:

// Multiplication by constant → shift + add:
int x = y * 2;    →   int x = y << 1;    // 1 cycle vs 3 cycles

// Division by constant → multiply + shift:
int x = y / 8;    →   int x = y >> 3;

// Array indexing → pointer arithmetic:
arr[i] = 5;       →   *(arr + i * sizeof(int)) = 5;

In assembly, a multiplication by 2 becomes:

; Source: return x * 2;

; Naive (without strength reduction):
imul    eax, edi, 2      ; 3 cycles, 4 bytes

; After strength reduction:
lea     eax, [rdi + rdi] ; 1 cycle, 3 bytes — no multiply at all!

The LEA (Load Effective Address) instruction is the optimizer’s favorite trick — it can compute a + b*scale + offset without a separate arithmetic instruction.

📊 GCC vs Clang Optimization Levels

Level	Name	What It Enables
`-O0`	None	Fastest compile, no optimization. Debug-friendly.
`-O1`	Optimize	Basic passes: CSE, DCE, constant folding. Good ratio of compile speed to runtime.
`-O2`	Optimize more	All `-O1` + inlining, LICM, alias analysis. Default for most projects.
`-O3`	Optimize most	All `-O2` + aggressive inlining, vectorization, loop unrolling. May increase code size.
`-Os`	Optimize size	`-O2` minus transformations that increase code size. Critical for embedded systems.
`-Oz`	Optimize size aggressively	Even smaller than `-Os`. May use slower but smaller instruction sequences.

Concrete Example: Counting Loop

// sum.c
int sum(int *arr, int n) {
    int total = 0;
    for (int i = 0; i < n; i++) {
        total += arr[i];
    }
    return total;
}

-O0 (no optimization) — literal translation:

sum:
    mov     dword ptr [rsp-4], 0     ; total = 0
    mov     dword ptr [rsp-8], 0     ; i = 0
    jmp     .Lcond
.Lloop:
    movsxd  rax, dword ptr [rsp-8]
    lea     rcx, [rdi + rax*4]
    add     ecx, dword ptr [rcx]
    mov     dword ptr [rsp-4], ecx
    add     dword ptr [rsp-8], 1     ; i++
.Lcond:
    cmp     dword ptr [rsp-8], esi
    jl      .Lloop
    mov     eax, dword ptr [rsp-4]
    ret

-O3 — Clang produces:

sum:
    test    esi, esi
    jle     .Lempty
    lea     rcx, [rdi + 4*rsi]      ; End pointer
    xor     eax, eax                ; total = 0
.Lloop:
    add     eax, dword ptr [rdi]    ; total += arr[i]
    add     rdi, 4                  ; ptr++ (not i++)
    cmp     rdi, rcx
    jb      .Lloop
    ret
.Lempty:
    xor     eax, eax
    ret

Key differences at -O3:

No stack spills — everything in registers
Pointer arithmetic replaces indexing — rdi walks through memory, no multiply needed
i++ becomes pointer comparison — no separate counter
Loop inverted — test at bottom, one less jump

With `-O3 -march=native` (AVX2 available):

sum:
    test     esi, esi
    jle      .Lempty
    xor      eax, eax
    cmp      esi, 7
    jl       .Lscalar                 ; < 8 elements → scalar
    ...
    ; Vectorized: process 8 ints at a time!
    vpaddd   ymm0, ymm1, ymmword ptr [rdi + 4*rax]
    ...

🔄 The Phase Ordering Problem

Optimization passes interact in complex ways. Running passes in different orders can produce different results:

// Example where pass order matters:
int f(int x) {
    return (x * 2) + (x * 2);
}

// Order A: CSE first, then constant folding
// CSE: t = x * 2; return t + t;
// CF: return x * 4;                ← Better

// Order B: Constant folding first, then CSE
// CF: no change (no constant operands)
// CSE: t = x * 2; return t + t;   ← Missed opportunity

Modern compilers use pass pipelines tuned experimentally — LLVM uses a fixed pipeline with some heuristic reordering, while GCC’s -O2 and -O3 sequences have been refined over 30+ years.

📈 Profile-Guided Optimization (PGO)

PGO uses runtime profiling to inform optimization:

Step 1: Compile with instrumentation
  $ clang -fprofile-instr-generate -O2 program.c -o program

Step 2: Run against representative workloads
  $ ./program < large_input.txt

Step 3: Re-compile with profile data
  $ llvm-profdata merge -output=code.profdata default.profraw
  $ clang -fprofile-instr-use=code.profdata -O2 program.c -o program

The profile tells the compiler:

Which branches are taken (enables better branch layout, block placement)
Which functions are hot (more aggressive inlining)
Which loops iterate many times (more unrolling, vectorization)

Google reported 15–40% speedups across their C++ codebase after deploying PGO with LLVM.

🔗 Link-Time Optimization (LTO)

Traditional compilation optimizes one file at a time — missing opportunities across translation unit boundaries:

// util.c
int add(int a, int b) { return a + b; }

// main.c
#include "util.h"
int main() { return add(2, 3); }

Without LTO, add can only be inlined if the compiler sees both files simultaneously. With LTO (-flto), the compiler stores IR in object files and runs global optimization at link time — enabling cross-file inlining, dead code elimination across the entire program, and whole-program alias analysis.

🎯 Takeaways

Optimization	Typical Speedup	When It Fires
Constant folding	1–5%	Any constant expression
DCE	1–10%	Unused code, dead stores
CSE	5–15%	Repeated subexpressions
LICM	10–50%+	Loop-heavy code
Inlining	10–30%	Small, frequently-called functions
Vectorization	2–8×	Data-parallel loops
PGO	15–40%	Any hot-path code
LTO	5–20%	Multi-file projects

The golden rule: Write clear, idiomatic code and trust the optimizer. Don’t hand-optimize unless profiler evidence demands it — the compiler is better at micro-optimizations than humans, and readable code enables more optimizations because the compiler can reason about it more precisely.

📚 Further Reading

Resource	Link
LLVM Pass Documentation	llvm.org/docs/Passes.html
Compiler Explorer (Godbolt)	godbolt.org
A Tour Through GCC’s Optimization Levels	godbolt.org/z/TTTT
Phase Ordering in LLVM	LLVM Dev Talk
”Optimizing Compilers for Modern Architectures” (Allen & Kennedy)	Textbook, 2001

← Previous
🥷 Zero-Knowledge Proofs

Next →
🦀 Rust’s Borrow Checker

← [b]ack

posts/ 🎓 Things CS Degrees Don't Teach You [n]ext → posts/ 🔧 MCP Workflow Builder — Visual DAG for MCP Tools