⚡ Why Is X Language Fast or Slow? — Compiled vs JIT vs Interpreted

Pranesh Nikhar Mar 11, 2026 03/11/26 10 min read

 󰈤

The spectrum from AOT (C/Rust/Go) → JIT (Java V8/LuaJIT/PyPy) → interpreted (CPython/bash), what each tier costs, V8\u2019s 4-tier pipeline, PyPy\u2019s tracing JIT, monomorphization, and why "compiled vs interpreted" is the wrong question.

🧩 The Wrong Question

The most common question from new developers: “Is Python compiled or interpreted?” The answer used to be “interpreted” — but that stopped being useful around 2010.

The real question is:

“What execution tiers does this language’s runtime have?”

Every language runtime sits on a spectrum from pure ahead-of-time (AOT) compilation to pure AST walking. The performance you see depends on where your runtime is on this spectrum, and — crucially — how long it’s been running.

📊 The Execution Spectrum

Ahead-of-Time                    Just-in-Time                     Interpretation
(C, Rust, Go)                    (Java V8, LuaJIT, PyPy)          (CPython, Bash)

Slow compile ────────────────────────────────────────────────── Fast startup
  Fast exec                                                    Slow exec
        │                          │                                  │
        ▼                          ▼                                  ▼
┌─────────────────┐   ┌──────────────────────────┐   ┌──────────────────────┐
│ .c → compile →  │   │ .java → bytecode → JIT → │   │ .py → AST → eval     │
│ native binary   │   │ native                    │   │ loop (bytecode)      │
└─────────────────┘   └──────────────────────────┘   └──────────────────────┘

Tier	Examples	Startup	Steady-State	Dev Experience
AOT	C, Rust, Go, Zig	Instant	Fastest	Slow compile, fast run
Baseline JIT	LuaJIT, V8 (Liftoff)	Fast	Fast	Instant startup, warms up
Optimizing JIT	Java C2/Graal, V8 TurboFan	Slow	Nearly AOT speed	Warmup required
Tracing JIT	PyPy, LuaJIT trace	Medium	Fast for loops	Jit-unfriendly code hurts
Bytecode VM	CPython, Ruby MRI, PHP	Fast	Slow	Best iteration speed
Tree-walk	Bash, early JS engines	Fastest	Slowest	No compile step

🏛️ AOT: C, Rust, Go

AOT compilers translate your source code directly to machine code before execution. The result is a static binary with zero runtime overhead.

C: The Baseline

int sum(int n) {
    int total = 0;
    for (int i = 0; i < n; i++) total += i;
    return total;
}

Compiled with gcc -O2, this becomes:

sum:
        test    edi, edi
        jle     .L1
        lea     eax, [rdi-1]
        lea     ecx, [rdi-2]
        imul    ecx, eax
        shl     eax, 31
        shr     ecx, 1
        add     eax, ecx
        add     eax, edi
        ret
.L1:
        xor     eax, eax
        ret

GCC recognized the loop as sum(1..n) = n(n-1)/2 + n and replaced the entire loop with arithmetic. Zero loop overhead. Zero runtime checks.

Rust: Same AOT, More Safety

Rust compiles through LLVM, same as Clang. The generated code is equivalent in performance. Where Rust differs from C is at compile time: the borrow checker runs static analysis to guarantee memory safety, but this produces zero runtime cost:

fn sum(n: i32) -> i32 {
    (0..n).fold(0, |acc, i| acc + i)
}

With --release, LLVM optimizes this to the same constant-time formula. Rust’s zero-cost abstractions mean high-level constructs like iterators compile down to the same assembly as hand-written loops.

Go: AOT + GC

Go is AOT-compiled but includes a runtime (garbage collector, goroutine scheduler, memory allocator). This is the key difference: Go compiles to native code, but that native code calls into the runtime for memory management.

func sum(n int) int {
    total := 0
    for i := 0; i < n; i++ {
        total += i
    }
    return total
}

Go’s compiler doesn’t recognize the closed form (it doesn’t do the induction-variable optimization C compilers do). The loop runs as written. Additionally, total and i may be heap-allocated if the escape analysis determines they outlive the function — adding GC pressure.

Monomorphization: Rust and C++ generate specialized code for each generic instantiation. Vec<i32> and Vec<String> produce completely different machine code — optimal for each type. Go interfaces use runtime dispatch (similar to virtual methods), which defeats inlining and adds indirect call overhead.

🔥 JIT: From Java 1.0 to V8’s 4 Tiers

JIT compilers start execution quickly and gradually replace hot code paths with increasingly optimized native code.

V8’s 4-Tier Pipeline (JavaScript)

The absolute state of the art in JIT compilation. V8 doesn’t have one compiler — it has four:

Source Code
    │
    ▼
┌──────────┐
│ Ignition  │  ← Baseline bytecode interpreter (fastest startup)
│           │     Generates bytecode from AST, starts executing immediately
└─────┬────┘
      │ hot function detected
      ▼
┌──────────┐
│ Sparkplug │  ← "Baseline" compiler (~10× faster than Ignition)
│           │     Generates minimal native code, no optimizations
│           │     Sacrifices code quality for compilation speed
└─────┬────┘
      │ hotter (+ optimizing compiler available)
      ▼
┌──────────┐
│ Maglev    │  ← Mid-tier optimizing compiler (casual game ready)
│           │     Simple optimizations: inlining, constant folding
│           │     Fast compilation, good speed-up (Chromium late 2024)
└─────┬────┘
      │ hottest function (executed 1000+ times)
      ▼
┌──────────┐
│ TurboFan  │  ← Full optimizing compiler ("the big gun")
│           │     Sophisticated: type feedback, escape analysis,
│           │     loop invariant code motion, allocation sinking
│           │     Slow compilation, highest quality code
└──────────┘

Why four tiers? The JIT has to solve a fundamental trade-off: compile time vs execution time. If a function runs once, spending 10ms to compile it is a net loss. But if it runs 10 million times, spending 100ms to optimize it pays back instantly.

Tier	Compile Cost	Speed vs Interpreter	When Triggered
Ignition	~0ms (no compile)	1× (baseline)	Always
Sparkplug	~0.1ms	10×	After ~1 call
Maglev	~0.5ms	50×	After ~50 calls
TurboFan	~5-20ms	100-200×	After ~1000 calls

Type Feedback: The JIT’s Superpower

V8 collects type feedback as it runs:

function add(x, y) { return x + y; }

// First call: x = int, y = int
//   → V8 records: "add was called with (int, int)"
add(3, 4);

// TurboFan generates: mov rax, rdi; add rax, rsi; ret

// Later: x = string, y = string  
//   → V8 records: "also called with (string, string)"
add("hello", " world");

// This triggers deoptimization: TurboFan's optimized code is discarded.
// The function falls back to Ignition, which handles the type polymorphism.

The cost of type polymorphism is dramatic:

// Monomorphic: always same types → V8 can optimize heavily
function mono(shape) { return shape.x + shape.y; }
mono({x: 1, y: 2});     // V8 creates hidden class C0
mono({x: 3, y: 4});     // Same hidden class → fast path
// → ~50M ops/sec

// Polymorphic: different shapes → V8 gives up
function poly(shape) { return shape.x + shape.y; }
poly({x: 1, y: 2});     // Hidden class C0
poly({x: "a", y: "b"}); // Hidden class C1 (different)
poly({a: 1, b: 2});     // Hidden class C2 (different)
// → ~5M ops/sec (10× slower)

This is why writing “hot” JavaScript that doesn’t change its object shapes is critical for V8 performance — and why TypeScript (which doesn’t emit type-guided code) doesn’t help V8 optimize better.

PyPy’s Tracing JIT

PyPy takes a different approach from V8. Instead of compiling whole functions, it records traces — linear paths through loops:

# This loop runs → PyPy traces it
total = 0
for i in range(1_000_000):
    total += i  # ← trace starts here when loop is detected as hot

The trace records every operation as a sequence of S-expressions:

loop_begin:
    i3 = get(guard_value(p47))     # i = loop variable
    i4 = int_add(i3, 1)            # i + 1
    i5 = int_add(p46, i3)          # total += i
    set(p46, i5)                   # store total
    guard_value(i4 != 1000000)     # loop condition
    jump(loop_begin)

The trace is then optimized (loop invariant code motion, constant folding) and compiled to native code. If the program follows the trace, it runs at near-native speed. If a guard fails (e.g., a variable changes type), execution “exits” the trace and falls back to the interpreter.

PyPy’s tracing JIT is excellent for:

Tight numeric loops (NumPy-level speed for pure Python loops)
Simple data structures traversed linearly

PyPy is poor for:

Highly polymorphic code (many different types entering the same trace)
Short-running scripts (the JIT never warms up)
C extension modules (PyPy’s C API emulation is slow)

Python 3.15’s JIT (2025)

As of Python 3.13/3.14, CPython added a copy-and-patch JIT (called “JIT by default” in 3.14+). It’s not a full optimizing JIT like V8 — it’s closer to Sparkplug-level: it generates specialized machine code by copying pre-compiled templates and patching in the operands for each bytecode instruction.

CPython 3.12: pure bytecode interpreter
  → ~80 Python bytecodes per C function call

CPython 3.13: tier 1 interpreter + small JIT (experimental)
  → Some instructions become native code, no complex optimizations

CPython 3.14+: JIT enabled by default
  → ~2x speedup for CPU-bound Python, ~1.2x for typical web apps

It doesn’t make Python “fast” — but it closes the gap from “30× slower than C” to “15× slower than C.”

🐌 Interpreted: CPython, Bash, Ruby

Pure interpreters read source code (or bytecode) and execute it instruction by instruction. No native code is generated.

CPython’s execution model:

Python source
    │
    ▼
parse → AST → compile → bytecode (.pyc)
                            │
                            ▼
                    bytecode interpreter loop
                    (ceval.c: _PyEval_EvalFrameDefault)
                            │
                    ┌───────┴────────┐
                    │   opcode loop  │
                    │                │
                    │ for each op:   │
                    │   switch(op) { │
                    │     case BINARY_OP: ...     │
                    │     case LOAD_FAST: ...     │
                    │     case CALL_FUNCTION: ... │
                    │   }            │
                    └────────────────┘

Each bytecode instruction requires:

Fetch opcode and arguments from the bytecode array
Dispatch via a computed goto or switch statement
Perform the operation (which may involve dynamic type checks)
Store results back in the stack or locals array
Jump to next instruction

This loop overhead is ~10-20 CPU cycles per bytecode instruction. A single Python a + b might be 3-4 bytecodes, each taking ~50 cycles. In C, that’s one add instruction (~1 cycle). This is the fundamental reason interpreters are slow: each line of source code has a per-operation tax that doesn’t exist in compiled code.

📊 Benchmarks

From the Computer Language Benchmarks Game (all measurements normalized to C):

Benchmark	C (gcc -O2)	Rust	Go	Java (GraalVM)	Node.js (V8)	CPython	PyPy	LuaJIT
n-body	1.00×	1.01×	3.5×	1.2×	1.8×	90×	35×	1.3×
fannkuch-redux	1.00×	1.02×	2.8×	1.5×	2.0×	120×	45×	8×
binary-trees	1.00×	1.01×	4.0×	2.0×	3.0×	80×	30×	2.0×
regex-redux	1.00×	0.95×	1.5×	0.8×	0.7×	25×	12×	1.5×
pidigits	1.00×	1.00×	1.2×	1.0×	1.1×	3×	2.5×	1.0×
Geometric Mean	1.00×	1.00×	2.5×	1.3×	1.6×	45×	18×	2.0×

Key observations:

Rust = C: Zero-cost abstractions are real. Rust matches C performance within 1-2%.
Go: ~2.5× slower than C, almost entirely due to GC overhead and lack of certain LLVM optimizations.
Java (GraalVM): Near C speed for compute-heavy workloads. GC adds variance but average throughput is excellent.
Node.js (V8): Impressive for a dynamic language. V8’s 4-tier JIT is the best optimizing JIT ever built.
CPython: 45× slower than C. This is the “Python tax” — the cost of dynamic dispatch at every operation.
PyPy: ~2.5× faster than CPython for these numeric benchmarks. Still 18× slower than C.
LuaJIT: Nearly as fast as Java. LuaJIT’s trace-compiler is a masterpiece of JIT engineering (one of Mike Pall’s greatest contributions).

🔬 Warmup Effects

A critical and often overlooked dimension: how long does it take to reach peak performance?

Performance over time:
                     ┌──────────────────────┐
AOT (C/Rust/Go)      │██████████████████████│  peak from instruction 1
                     └──────────────────────┘
                     ┌──────────────┐
JIT (Java/V8/LuaJIT) │░░░░░░│████████████████│  ramp-up as JIT compiles
                     └──────────────┘
                     ┌──────────────────────┐
                     │░░░░  slow at startup, │
Interpreted (Python) │██████████████████████│  peak = steady state
                     └──────────────────────┘  (same speed always)

Serverless (AWS Lambda, CloudFlare Workers): JIT-heavy languages suffer because functions are cold-started. Java on Lambda can take 1-2 seconds to warm up. Python starts instantly but runs slow. This is why serverless pushes toward Node.js or Rust-compiled-to-Wasm.
Long-running servers (database, API, stream processing): The JIT’s warmup cost is negligible over hours of uptime. Java is extremely competitive here.
CLI tools: AOT wins. A 5ms compile + instant execution beats 0ms compile + 50ms execution.

Scenario	Best Runtime	Why
`ls`, `grep`	C/Rust/Go	Startup dominates
Web API endpoint	Java/.NET/Node	Long-lived process, JIT pays off
Data pipeline script	Python/PyPy	Fast to write, PyPy if CPU-bound
Game engine	C++/Rust	Predictable latency, no GC pauses
Shell script	Bash	Everything else is negligible

🎯 The Real Answer

The question “Is X compiled or interpreted?” is not useful. Instead, ask:

What execution tiers does the runtime have? (interpreter → baseline JIT → optimizing JIT)
How long does the program run? (warmup cost amortization)
What’s the tolerance for latency variance? (GC pauses, deoptimization)
How much control do I have over memory layout? (monomorphization, cache locality)

Every language runtime is converging to the same architecture: start with a fast interpreter, profile hot code, compile incrementally. CPython is getting a JIT. Java uses an interpreter + C1 + C2. V8 uses 4 tiers. Even Ruby’s YJIT (introduced in 3.1) is a baseline JIT.

The “compiled vs interpreted” war is over. The answer is: both, in tiers, depending on how hot the code is.

← Previous
🌳 LSM Trees & Bloom Filters

Next →
🎯 CAP Theorem with a Real Outage

← [b]ack

posts/ 🌳 LSM Trees & Bloom Filters — Production Deep Dive [n]ext → posts/ 🛡️ CVE-2025 Breach Analysis — Midnight Blizzard and the 16 Billion Credential Leak