 Command

Pranesh Nikhar's personal site. Vim-style keybinds for navigation; theme + font pickers below.

Theme
 Font Body Code
Reader
Keybinds
Navigation
j / ↓ Next item k / ↑ Previous item g First item in region G Last item in region zz Center focused item h / l Move left/right region ] / [ Next/previous heading } / { Next/previous block d / u Half-page down/up
Layout
<zh> / <zl> Toggle left/right sidebar <zr> Toggle reader view <zj> / <zk> Focus main/navbar <S-h/j/k/l> Focus left/main/navbar/right ⌃H / ⌃L Focus left/right sidebar ⌃J / ⌃K Focus main/navbar ⇧C / ⇧E Collapse / expand all sections
Dialogs
⌃P / : Command palette ⌃X Theme picker / Search ? Show keybinds Esc / ⌃C Close dialog
History
n Next document b Previous document ⌃O History back ⌃I History forward
 Search
about: Pranesh Nikhar about/more: 🪪 More docs/test: Docs Test ideas: 💡 Ideas more: ➕ More now: Now posts: 📬 Posts projects: 📚 Projects webtui: Style posts/agentic-eda: 📊 AgenticEDA — Automated Exploratory Data Analysis with LangGraph posts/cap-theorem-outage-story: 🌐 CAP Theorem with a Real Outage Story posts/codepilot: ✈️ CodePilot — From Requirements to Deployable FastAPI Backend posts/common-auth-mistakes: 🔐 Common Auth Mistakes Developers Make posts/compiled-vs-jit-vs-interpreted: ⚡ Why Is X Language Fast or Slow? — Compiled vs JIT vs Interpreted posts/cs-degree-gaps: 🎓 Things CS Degrees Don't Teach You posts/cve-2025-breach-analysis: 🛡️ CVE-2025 Breach Analysis — Midnight Blizzard and the 16 Billion Credential Leak posts/fixloop: 🔄 FixLoop — AI Agent Loop for Self-Correcting Code posts/functional-vs-oop: ⚡ Functional vs OOP — Same Problem, Both Ways posts/getman: 🦾 Getman — Declarative API Tester for CLI & TUI posts/how-compilers-optimize: ⚙️ How Compilers Actually Optimize Your Code posts/http3-quic: ⚡ HTTP/3 and QUIC — Why They Matter posts/leetcode-vs-engineering: 🧩 LeetCode vs Real Engineering Skills posts/llm-from-scratch: 🧠 LLM from Scratch — GPT-Style Transformer in PyTorch posts/lsm-trees-bloom-filters: 🌳 LSM Trees & Bloom Filters — Production Deep Dive posts/mcp-workflow-builder: 🔧 MCP Workflow Builder — Visual DAG for MCP Tools posts/persistent-memory: 🧠 Persistent Memory — Long-Term Memory for AI Agents via MCP posts/playcli: 🎬 PlayCLI — Terminal Video Player posts/postgres-mvcc: 🗄️ How PostgreSQL MVCC Works — Multi-Version Concurrency Control Deep Dive posts/raft-consensus: ⛵ Raft Consensus Algorithm Explained posts/rust-borrow-checker: 🦀 Rust Borrow Checker — Catches Real Bugs posts/titan: 🤖 Titan — Terminal AI Coding Agent posts/what-happens-url: 🌐 What Happens Between Typing a URL and Seeing the Page posts/what-happens-when-you-run-a-program: ⚙️ What Actually Happens When You Run a Program posts/zero-knowledge-proofs: 🔐 Zero-Knowledge Proofs Explained Simply webtui/components/accordion: Accordion webtui/components/badge: Badge webtui/components/button: Button webtui/components/checkbox: Checkbox webtui/components/dialog: Dialog webtui/components/input: Input webtui/components/popover: Popover webtui/components/pre: Pre webtui/components/progress: Progress webtui/components/radio: Radio webtui/components/range: Range webtui/components/separator: Separator webtui/components/spinner: Spinner webtui/components/switch: Switch webtui/components/table: Table webtui/components/textarea: Textarea webtui/components/tooltip: Popover webtui/components/typography: Typography webtui/components/view: View webtui/contributing/contributing: Contributing webtui/contributing/contributing: ## Local Development webtui/contributing/contributing: ## Issues webtui/contributing/contributing: ## Pull Requests webtui/contributing/style-guide: Style Guide webtui/contributing/style-guide: ## CSS Units webtui/contributing/style-guide: ## Selectors webtui/contributing/style-guide: ## Documentation webtui/installation/astro: Astro webtui/installation/astro: ## Scoping webtui/installation/astro: ### Frontmatter Imports webtui/installation/astro: ### ‹style› tag webtui/installation/astro: ### Full Library Import webtui/installation/nextjs: Next.js webtui/installation/vite: Vite webtui/plugins/plugin-dev: Developing Plugins webtui/plugins/plugin-dev: ### Style Layers webtui/plugins/plugin-nf: Nerd Font Plugin webtui/plugins/theme-catppuccin: Catppuccin Theme webtui/plugins/theme-custom: Custom Theme webtui/plugins/theme-everforest: Everforest Theme webtui/plugins/theme-gruvbox: Gruvbox Theme webtui/plugins/theme-nord: Nord Theme webtui/plugins/theme-vitesse: Vitesse Theme webtui/start/ascii-boxes: ASCII Boxes webtui/start/changelog: Changelog webtui/start/installation: Installation webtui/start/installation: ## Installation webtui/start/installation: ## Using CSS webtui/start/installation: ## Using ESM webtui/start/installation: ## Using a CDN webtui/start/installation: ## Full Library Import webtui/start/installation: ### CSS webtui/start/installation: ### ESM webtui/start/installation: ### CDN webtui/start/intro: Introduction webtui/start/intro: ## Features webtui/start/plugins: Plugins webtui/start/plugins: ## Official Plugins webtui/start/plugins: ### Themes webtui/start/plugins: ## Community Plugins webtui/start/theming: Theming webtui/start/theming: ## CSS Variables webtui/start/theming: ### Font Styles webtui/start/theming: ### Colors webtui/start/theming: ### Light & Dark webtui/start/theming: ## Theme Plugins webtui/start/theming: ### Using Multiple Theme Accents webtui/start/tuis-vs-guis: TUIs vs GUIs webtui/start/tuis-vs-guis: ## Monospace Fonts webtui/start/tuis-vs-guis: ## Character Cells
 Theme Current: Light j/k or ↑/↓ + Enter

⚙️ How Compilers Actually Optimize Your Code

From C source to x86 assembly — a tour through the compiler pipeline, the key optimization passes, and how to read the output of -O3 with GCC and Clang.

🏗️ The Compiler Pipeline

Your source code goes through four stages before becoming executable machine code:

Source (C/Rust/Go)
  → Frontend (lexing, parsing, semantic analysis)
    → IR (Intermediate Representation)
      → Optimizer (repeated passes over IR)
        → Backend (instruction selection, register allocation)
          → Codegen (machine code / assembly)

The Intermediate Representation (IR) is the crucial abstraction — a language- and architecture-independent form that the optimizer can reason about mathematically. Both GCC (GIMPLE → RTL) and LLVM/Clang (LLVM-IR) use multi-level IRs.

Godbolt Compiler Explorer lets you explore this pipeline interactively: write C or Rust on the left, and see the resulting assembly on the right, annotated with source lines.


📐 SSA Form: The Foundation

Most modern compilers convert IR into Static Single Assignment (SSA) form before optimization. In SSA, every variable is assigned exactly once, and new versions are created with numbered subscripts:

// Original:
x = 1
x = x + 5
y = x * 2

// SSA Form:
x₁ = 1
x₂ = x₁ + 5
y₁ = x₂ * 2

This seemingly simple transformation is the foundation for almost every optimization. With SSA, the def-use chain — the relationship between where a value is defined and where it is used — becomes explicit in the IR itself, eliminating expensive data-flow analysis.

At merge points (if-else, loops), SSA uses φ (phi) functions to select the correct version:

// SSA with a branch:
if (cond):
  x₁ = 1
else:
  x₂ = 2
x₃ = φ(x₁, x₂)  // Picks x₁ or x₂ depending on which path was taken
return x₃

🔧 Key Optimization Passes

1. Constant Folding

The compiler evaluates constant expressions at compile time, not runtime:

// Source:
int x = 60 * 60 * 24;  // seconds in a day

// After constant folding — compiler replaces with:
int x = 86400;

In LLVM-IR:

; Before:
%0 = mul i32 60, 60    ; 60 * 60 = 3600
%1 = mul i32 %0, 24    ; 3600 * 24 = 86400

; After constant folding:
%1 = i32 86400

2. Dead Code Elimination (DCE)

Removes code that computes values never used:

int compute(int x) {
    int a = x * 2;      // DEAD — a is never used
    int b = x * 3;
    return b;
}

// Optimized:
int compute(int x) {
    return x * 3;
}

DCE is especially powerful combined with other passes — inlining may reveal that a function’s return value is unused, triggering removal of the entire call.

3. Common Subexpression Elimination (CSE)

If the same value is computed twice, the compiler hoists the computation:

int expensive(int x, int y) {
    int a = (x * y) + 42;     // First compute x*y
    int b = (x * y) + 99;     // Same x*y computed again
    return a + b;
}

// After CSE:
int expensive(int x, int y) {
    int t = x * y;            // Compute once
    int a = t + 42;
    int b = t + 99;
    return a + b;
}

4. Loop Invariant Code Motion (LICM)

Computations that produce the same result on every loop iteration are moved outside:

// Before:
for (int i = 0; i < n; i++) {
    arr[i] = arr[i] * sqrt(2.0);  // sqrt(2.0) computed n times!
}

// After LICM:
double factor = sqrt(2.0);        // Computed once
for (int i = 0; i < n; i++) {
    arr[i] = arr[i] * factor;
}

The compiler can only do this when it can prove sqrt(2.0) has no side effects — a category of analysis called aliasing and side-effect analysis.

5. Inlining

Replaces a function call with the function body:

// Before:
int square(int x) { return x * x; }
int result = square(42);

// After inlining:
int result = 42 * 42;

Inlining eliminates call overhead and, critically, enables downstream optimizations — the inlined code may now participate in constant folding, CSE, and DCE globally.

6. Vectorization (SIMD)

The compiler transforms scalar operations to use Single Instruction, Multiple Data (SIMD) instructions — operating on 4, 8, or 16 values simultaneously:

// Simple loop — compiler produces SIMD:
void add_arrays(float *a, float *b, float *c, int n) {
    for (int i = 0; i < n; i++) {
        c[i] = a[i] + b[i];
    }
}

On x86-64 with -O3 -march=native, Clang may generate:

; Using 256-bit YMM registers (AVX2):
vmovups     ymm0, ymmword ptr [rdi + rcx]  ; Load 8 floats from a
vaddps      ymm0, ymm0, ymmword ptr [rsi + rcx]  ; Add 8 floats from b
vmovups     ymmword ptr [rdx + rcx], ymm0  ; Store 8 floats to c
add         rcx, 32                         ; Advance 32 bytes (8 floats)
cmp         rcx, rax
jb          .LBB0_1

The loop processes 8 floats per iteration instead of 1 — a theoretical 8× speedup.

7. Strength Reduction

Replaces expensive operations with cheaper equivalents:

// Multiplication by constant → shift + add:
int x = y * 2;int x = y << 1;    // 1 cycle vs 3 cycles

// Division by constant → multiply + shift:
int x = y / 8;int x = y >> 3;

// Array indexing → pointer arithmetic:
arr[i] = 5;*(arr + i * sizeof(int)) = 5;

In assembly, a multiplication by 2 becomes:

; Source: return x * 2;

; Naive (without strength reduction):
imul    eax, edi, 2      ; 3 cycles, 4 bytes

; After strength reduction:
lea     eax, [rdi + rdi] ; 1 cycle, 3 bytes — no multiply at all!

The LEA (Load Effective Address) instruction is the optimizer’s favorite trick — it can compute a + b*scale + offset without a separate arithmetic instruction.


📊 GCC vs Clang Optimization Levels

LevelNameWhat It Enables
-O0NoneFastest compile, no optimization. Debug-friendly.
-O1OptimizeBasic passes: CSE, DCE, constant folding. Good ratio of compile speed to runtime.
-O2Optimize moreAll -O1 + inlining, LICM, alias analysis. Default for most projects.
-O3Optimize mostAll -O2 + aggressive inlining, vectorization, loop unrolling. May increase code size.
-OsOptimize size-O2 minus transformations that increase code size. Critical for embedded systems.
-OzOptimize size aggressivelyEven smaller than -Os. May use slower but smaller instruction sequences.

Concrete Example: Counting Loop

// sum.c
int sum(int *arr, int n) {
    int total = 0;
    for (int i = 0; i < n; i++) {
        total += arr[i];
    }
    return total;
}

-O0 (no optimization) — literal translation:

sum:
    mov     dword ptr [rsp-4], 0     ; total = 0
    mov     dword ptr [rsp-8], 0     ; i = 0
    jmp     .Lcond
.Lloop:
    movsxd  rax, dword ptr [rsp-8]
    lea     rcx, [rdi + rax*4]
    add     ecx, dword ptr [rcx]
    mov     dword ptr [rsp-4], ecx
    add     dword ptr [rsp-8], 1     ; i++
.Lcond:
    cmp     dword ptr [rsp-8], esi
    jl      .Lloop
    mov     eax, dword ptr [rsp-4]
    ret

-O3 — Clang produces:

sum:
    test    esi, esi
    jle     .Lempty
    lea     rcx, [rdi + 4*rsi]      ; End pointer
    xor     eax, eax                ; total = 0
.Lloop:
    add     eax, dword ptr [rdi]    ; total += arr[i]
    add     rdi, 4                  ; ptr++ (not i++)
    cmp     rdi, rcx
    jb      .Lloop
    ret
.Lempty:
    xor     eax, eax
    ret

Key differences at -O3:

  • No stack spills — everything in registers
  • Pointer arithmetic replaces indexingrdi walks through memory, no multiply needed
  • i++ becomes pointer comparison — no separate counter
  • Loop inverted — test at bottom, one less jump

With -O3 -march=native (AVX2 available):

sum:
    test     esi, esi
    jle      .Lempty
    xor      eax, eax
    cmp      esi, 7
    jl       .Lscalar                 ; < 8 elements → scalar
    ...
    ; Vectorized: process 8 ints at a time!
    vpaddd   ymm0, ymm1, ymmword ptr [rdi + 4*rax]
    ...

🔄 The Phase Ordering Problem

Optimization passes interact in complex ways. Running passes in different orders can produce different results:

// Example where pass order matters:
int f(int x) {
    return (x * 2) + (x * 2);
}

// Order A: CSE first, then constant folding
// CSE: t = x * 2; return t + t;
// CF: return x * 4;                ← Better

// Order B: Constant folding first, then CSE
// CF: no change (no constant operands)
// CSE: t = x * 2; return t + t;   ← Missed opportunity

Modern compilers use pass pipelines tuned experimentally — LLVM uses a fixed pipeline with some heuristic reordering, while GCC’s -O2 and -O3 sequences have been refined over 30+ years.


📈 Profile-Guided Optimization (PGO)

PGO uses runtime profiling to inform optimization:

Step 1: Compile with instrumentation
  $ clang -fprofile-instr-generate -O2 program.c -o program

Step 2: Run against representative workloads
  $ ./program < large_input.txt

Step 3: Re-compile with profile data
  $ llvm-profdata merge -output=code.profdata default.profraw
  $ clang -fprofile-instr-use=code.profdata -O2 program.c -o program

The profile tells the compiler:

  • Which branches are taken (enables better branch layout, block placement)
  • Which functions are hot (more aggressive inlining)
  • Which loops iterate many times (more unrolling, vectorization)

Google reported 15–40% speedups across their C++ codebase after deploying PGO with LLVM.


Traditional compilation optimizes one file at a time — missing opportunities across translation unit boundaries:

// util.c
int add(int a, int b) { return a + b; }

// main.c
#include "util.h"
int main() { return add(2, 3); }

Without LTO, add can only be inlined if the compiler sees both files simultaneously. With LTO (-flto), the compiler stores IR in object files and runs global optimization at link time — enabling cross-file inlining, dead code elimination across the entire program, and whole-program alias analysis.


🎯 Takeaways

OptimizationTypical SpeedupWhen It Fires
Constant folding1–5%Any constant expression
DCE1–10%Unused code, dead stores
CSE5–15%Repeated subexpressions
LICM10–50%+Loop-heavy code
Inlining10–30%Small, frequently-called functions
Vectorization2–8×Data-parallel loops
PGO15–40%Any hot-path code
LTO5–20%Multi-file projects

The golden rule: Write clear, idiomatic code and trust the optimizer. Don’t hand-optimize unless profiler evidence demands it — the compiler is better at micro-optimizations than humans, and readable code enables more optimizations because the compiler can reason about it more precisely.


📚 Further Reading

ResourceLink
LLVM Pass Documentationllvm.org/docs/Passes.html
Compiler Explorer (Godbolt)godbolt.org
A Tour Through GCC’s Optimization Levelsgodbolt.org/z/TTTT
Phase Ordering in LLVMLLVM Dev Talk
”Optimizing Compilers for Modern Architectures” (Allen & Kennedy)Textbook, 2001

📖 Series Navigation

 praneshnikhar.site / posts / how-compilers-optimize · Top 1:1