⚡ Why Is X Language Fast or Slow? — Compiled vs JIT vs Interpreted
The spectrum from AOT (C/Rust/Go) → JIT (Java V8/LuaJIT/PyPy) → interpreted (CPython/bash), what each tier costs, V8\u2019s 4-tier pipeline, PyPy\u2019s tracing JIT, monomorphization, and why "compiled vs interpreted" is the wrong question.
🧩 The Wrong Question
The most common question from new developers: “Is Python compiled or interpreted?” The answer used to be “interpreted” — but that stopped being useful around 2010.
The real question is:
“What execution tiers does this language’s runtime have?”
Every language runtime sits on a spectrum from pure ahead-of-time (AOT) compilation to pure AST walking. The performance you see depends on where your runtime is on this spectrum, and — crucially — how long it’s been running.
📊 The Execution Spectrum
Ahead-of-Time Just-in-Time Interpretation
(C, Rust, Go) (Java V8, LuaJIT, PyPy) (CPython, Bash)
Slow compile ────────────────────────────────────────────────── Fast startup
Fast exec Slow exec
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌──────────────────────────┐ ┌──────────────────────┐
│ .c → compile → │ │ .java → bytecode → JIT → │ │ .py → AST → eval │
│ native binary │ │ native │ │ loop (bytecode) │
└─────────────────┘ └──────────────────────────┘ └──────────────────────┘
| Tier | Examples | Startup | Steady-State | Dev Experience |
|---|---|---|---|---|
| AOT | C, Rust, Go, Zig | Instant | Fastest | Slow compile, fast run |
| Baseline JIT | LuaJIT, V8 (Liftoff) | Fast | Fast | Instant startup, warms up |
| Optimizing JIT | Java C2/Graal, V8 TurboFan | Slow | Nearly AOT speed | Warmup required |
| Tracing JIT | PyPy, LuaJIT trace | Medium | Fast for loops | Jit-unfriendly code hurts |
| Bytecode VM | CPython, Ruby MRI, PHP | Fast | Slow | Best iteration speed |
| Tree-walk | Bash, early JS engines | Fastest | Slowest | No compile step |
🏛️ AOT: C, Rust, Go
AOT compilers translate your source code directly to machine code before execution. The result is a static binary with zero runtime overhead.
C: The Baseline
int sum(int n) {
int total = 0;
for (int i = 0; i < n; i++) total += i;
return total;
}
Compiled with gcc -O2, this becomes:
sum:
test edi, edi
jle .L1
lea eax, [rdi-1]
lea ecx, [rdi-2]
imul ecx, eax
shl eax, 31
shr ecx, 1
add eax, ecx
add eax, edi
ret
.L1:
xor eax, eax
ret
GCC recognized the loop as sum(1..n) = n(n-1)/2 + n and replaced the entire loop with arithmetic. Zero loop overhead. Zero runtime checks.
Rust: Same AOT, More Safety
Rust compiles through LLVM, same as Clang. The generated code is equivalent in performance. Where Rust differs from C is at compile time: the borrow checker runs static analysis to guarantee memory safety, but this produces zero runtime cost:
fn sum(n: i32) -> i32 {
(0..n).fold(0, |acc, i| acc + i)
}
With --release, LLVM optimizes this to the same constant-time formula. Rust’s zero-cost abstractions mean high-level constructs like iterators compile down to the same assembly as hand-written loops.
Go: AOT + GC
Go is AOT-compiled but includes a runtime (garbage collector, goroutine scheduler, memory allocator). This is the key difference: Go compiles to native code, but that native code calls into the runtime for memory management.
func sum(n int) int {
total := 0
for i := 0; i < n; i++ {
total += i
}
return total
}
Go’s compiler doesn’t recognize the closed form (it doesn’t do the induction-variable optimization C compilers do). The loop runs as written. Additionally, total and i may be heap-allocated if the escape analysis determines they outlive the function — adding GC pressure.
Monomorphization: Rust and C++ generate specialized code for each generic instantiation. Vec<i32> and Vec<String> produce completely different machine code — optimal for each type. Go interfaces use runtime dispatch (similar to virtual methods), which defeats inlining and adds indirect call overhead.
🔥 JIT: From Java 1.0 to V8’s 4 Tiers
JIT compilers start execution quickly and gradually replace hot code paths with increasingly optimized native code.
V8’s 4-Tier Pipeline (JavaScript)
The absolute state of the art in JIT compilation. V8 doesn’t have one compiler — it has four:
Source Code
│
▼
┌──────────┐
│ Ignition │ ← Baseline bytecode interpreter (fastest startup)
│ │ Generates bytecode from AST, starts executing immediately
└─────┬────┘
│ hot function detected
▼
┌──────────┐
│ Sparkplug │ ← "Baseline" compiler (~10× faster than Ignition)
│ │ Generates minimal native code, no optimizations
│ │ Sacrifices code quality for compilation speed
└─────┬────┘
│ hotter (+ optimizing compiler available)
▼
┌──────────┐
│ Maglev │ ← Mid-tier optimizing compiler (casual game ready)
│ │ Simple optimizations: inlining, constant folding
│ │ Fast compilation, good speed-up (Chromium late 2024)
└─────┬────┘
│ hottest function (executed 1000+ times)
▼
┌──────────┐
│ TurboFan │ ← Full optimizing compiler ("the big gun")
│ │ Sophisticated: type feedback, escape analysis,
│ │ loop invariant code motion, allocation sinking
│ │ Slow compilation, highest quality code
└──────────┘
Why four tiers? The JIT has to solve a fundamental trade-off: compile time vs execution time. If a function runs once, spending 10ms to compile it is a net loss. But if it runs 10 million times, spending 100ms to optimize it pays back instantly.
| Tier | Compile Cost | Speed vs Interpreter | When Triggered |
|---|---|---|---|
| Ignition | ~0ms (no compile) | 1× (baseline) | Always |
| Sparkplug | ~0.1ms | 10× | After ~1 call |
| Maglev | ~0.5ms | 50× | After ~50 calls |
| TurboFan | ~5-20ms | 100-200× | After ~1000 calls |
Type Feedback: The JIT’s Superpower
V8 collects type feedback as it runs:
function add(x, y) { return x + y; }
// First call: x = int, y = int
// → V8 records: "add was called with (int, int)"
add(3, 4);
// TurboFan generates: mov rax, rdi; add rax, rsi; ret
// Later: x = string, y = string
// → V8 records: "also called with (string, string)"
add("hello", " world");
// This triggers deoptimization: TurboFan's optimized code is discarded.
// The function falls back to Ignition, which handles the type polymorphism.
The cost of type polymorphism is dramatic:
// Monomorphic: always same types → V8 can optimize heavily
function mono(shape) { return shape.x + shape.y; }
mono({x: 1, y: 2}); // V8 creates hidden class C0
mono({x: 3, y: 4}); // Same hidden class → fast path
// → ~50M ops/sec
// Polymorphic: different shapes → V8 gives up
function poly(shape) { return shape.x + shape.y; }
poly({x: 1, y: 2}); // Hidden class C0
poly({x: "a", y: "b"}); // Hidden class C1 (different)
poly({a: 1, b: 2}); // Hidden class C2 (different)
// → ~5M ops/sec (10× slower)
This is why writing “hot” JavaScript that doesn’t change its object shapes is critical for V8 performance — and why TypeScript (which doesn’t emit type-guided code) doesn’t help V8 optimize better.
PyPy’s Tracing JIT
PyPy takes a different approach from V8. Instead of compiling whole functions, it records traces — linear paths through loops:
# This loop runs → PyPy traces it
total = 0
for i in range(1_000_000):
total += i # ← trace starts here when loop is detected as hot
The trace records every operation as a sequence of S-expressions:
loop_begin:
i3 = get(guard_value(p47)) # i = loop variable
i4 = int_add(i3, 1) # i + 1
i5 = int_add(p46, i3) # total += i
set(p46, i5) # store total
guard_value(i4 != 1000000) # loop condition
jump(loop_begin)
The trace is then optimized (loop invariant code motion, constant folding) and compiled to native code. If the program follows the trace, it runs at near-native speed. If a guard fails (e.g., a variable changes type), execution “exits” the trace and falls back to the interpreter.
PyPy’s tracing JIT is excellent for:
- Tight numeric loops (NumPy-level speed for pure Python loops)
- Simple data structures traversed linearly
PyPy is poor for:
- Highly polymorphic code (many different types entering the same trace)
- Short-running scripts (the JIT never warms up)
- C extension modules (PyPy’s C API emulation is slow)
Python 3.15’s JIT (2025)
As of Python 3.13/3.14, CPython added a copy-and-patch JIT (called “JIT by default” in 3.14+). It’s not a full optimizing JIT like V8 — it’s closer to Sparkplug-level: it generates specialized machine code by copying pre-compiled templates and patching in the operands for each bytecode instruction.
CPython 3.12: pure bytecode interpreter
→ ~80 Python bytecodes per C function call
CPython 3.13: tier 1 interpreter + small JIT (experimental)
→ Some instructions become native code, no complex optimizations
CPython 3.14+: JIT enabled by default
→ ~2x speedup for CPU-bound Python, ~1.2x for typical web apps
It doesn’t make Python “fast” — but it closes the gap from “30× slower than C” to “15× slower than C.”
🐌 Interpreted: CPython, Bash, Ruby
Pure interpreters read source code (or bytecode) and execute it instruction by instruction. No native code is generated.
CPython’s execution model:
Python source
│
▼
parse → AST → compile → bytecode (.pyc)
│
▼
bytecode interpreter loop
(ceval.c: _PyEval_EvalFrameDefault)
│
┌───────┴────────┐
│ opcode loop │
│ │
│ for each op: │
│ switch(op) { │
│ case BINARY_OP: ... │
│ case LOAD_FAST: ... │
│ case CALL_FUNCTION: ... │
│ } │
└────────────────┘
Each bytecode instruction requires:
- Fetch opcode and arguments from the bytecode array
- Dispatch via a computed goto or switch statement
- Perform the operation (which may involve dynamic type checks)
- Store results back in the stack or locals array
- Jump to next instruction
This loop overhead is ~10-20 CPU cycles per bytecode instruction. A single Python a + b might be 3-4 bytecodes, each taking ~50 cycles. In C, that’s one add instruction (~1 cycle). This is the fundamental reason interpreters are slow: each line of source code has a per-operation tax that doesn’t exist in compiled code.
📊 Benchmarks
From the Computer Language Benchmarks Game (all measurements normalized to C):
| Benchmark | C (gcc -O2) | Rust | Go | Java (GraalVM) | Node.js (V8) | CPython | PyPy | LuaJIT |
|---|---|---|---|---|---|---|---|---|
| n-body | 1.00× | 1.01× | 3.5× | 1.2× | 1.8× | 90× | 35× | 1.3× |
| fannkuch-redux | 1.00× | 1.02× | 2.8× | 1.5× | 2.0× | 120× | 45× | 8× |
| binary-trees | 1.00× | 1.01× | 4.0× | 2.0× | 3.0× | 80× | 30× | 2.0× |
| regex-redux | 1.00× | 0.95× | 1.5× | 0.8× | 0.7× | 25× | 12× | 1.5× |
| pidigits | 1.00× | 1.00× | 1.2× | 1.0× | 1.1× | 3× | 2.5× | 1.0× |
| Geometric Mean | 1.00× | 1.00× | 2.5× | 1.3× | 1.6× | 45× | 18× | 2.0× |
Key observations:
- Rust = C: Zero-cost abstractions are real. Rust matches C performance within 1-2%.
- Go: ~2.5× slower than C, almost entirely due to GC overhead and lack of certain LLVM optimizations.
- Java (GraalVM): Near C speed for compute-heavy workloads. GC adds variance but average throughput is excellent.
- Node.js (V8): Impressive for a dynamic language. V8’s 4-tier JIT is the best optimizing JIT ever built.
- CPython: 45× slower than C. This is the “Python tax” — the cost of dynamic dispatch at every operation.
- PyPy: ~2.5× faster than CPython for these numeric benchmarks. Still 18× slower than C.
- LuaJIT: Nearly as fast as Java. LuaJIT’s trace-compiler is a masterpiece of JIT engineering (one of Mike Pall’s greatest contributions).
🔬 Warmup Effects
A critical and often overlooked dimension: how long does it take to reach peak performance?
Performance over time:
┌──────────────────────┐
AOT (C/Rust/Go) │██████████████████████│ peak from instruction 1
└──────────────────────┘
┌──────────────┐
JIT (Java/V8/LuaJIT) │░░░░░░│████████████████│ ramp-up as JIT compiles
└──────────────┘
┌──────────────────────┐
│░░░░ slow at startup, │
Interpreted (Python) │██████████████████████│ peak = steady state
└──────────────────────┘ (same speed always)
-
Serverless (AWS Lambda, CloudFlare Workers): JIT-heavy languages suffer because functions are cold-started. Java on Lambda can take 1-2 seconds to warm up. Python starts instantly but runs slow. This is why serverless pushes toward Node.js or Rust-compiled-to-Wasm.
-
Long-running servers (database, API, stream processing): The JIT’s warmup cost is negligible over hours of uptime. Java is extremely competitive here.
-
CLI tools: AOT wins. A 5ms compile + instant execution beats 0ms compile + 50ms execution.
| Scenario | Best Runtime | Why |
|---|---|---|
ls, grep | C/Rust/Go | Startup dominates |
| Web API endpoint | Java/.NET/Node | Long-lived process, JIT pays off |
| Data pipeline script | Python/PyPy | Fast to write, PyPy if CPU-bound |
| Game engine | C++/Rust | Predictable latency, no GC pauses |
| Shell script | Bash | Everything else is negligible |
🎯 The Real Answer
The question “Is X compiled or interpreted?” is not useful. Instead, ask:
- What execution tiers does the runtime have? (interpreter → baseline JIT → optimizing JIT)
- How long does the program run? (warmup cost amortization)
- What’s the tolerance for latency variance? (GC pauses, deoptimization)
- How much control do I have over memory layout? (monomorphization, cache locality)
Every language runtime is converging to the same architecture: start with a fast interpreter, profile hot code, compile incrementally. CPython is getting a JIT. Java uses an interpreter + C1 + C2. V8 uses 4 tiers. Even Ruby’s YJIT (introduced in 3.1) is a baseline JIT.
The “compiled vs interpreted” war is over. The answer is: both, in tiers, depending on how hot the code is.