⚙️ What Actually Happens When You Run a Program

Pranesh Nikhar Jun 27, 2026 06/27/26 12 min read

 󰈤

From ./a.out to exit() — the full journey through ELF loading, fork+exec, virtual memory layout, syscalls, page faults, context switches, and kernel mode.

🎬 The Illusion

Every time you type ./program and hit Enter, the operating system orchestrates an elaborate sequence of hardware and software events to create the illusion that your program is the only thing running on the machine. The reality is far more complex — and far more interesting.

This post traces the complete lifecycle of a process, from the shell parsing your command to the kernel reclaiming memory on exit. We’ll look at Unix specifically (Linux/macOS), since that’s what most of us develop on.

$ ./hello
Hello, world!
$ _

What really happened between that Enter key and the shell prompt coming back? Let’s find out.

🐚 Stage 1: The Shell Interprets Your Command

You’re sitting in a shell — bash, zsh, or similar. The shell:

Reads the line ./hello from stdin (see below: the read syscall)
Parses it into a command name and arguments
Checks if it’s a builtin (it’s not)
Resolves the path ./hello to an absolute path (e.g. /home/user/hello)
Confirms the file exists and is executable (via access() or stat())

If the shell can’t find the file or it lacks execute permission, you get command not found or Permission denied — no process is created.

🔄 Stage 2: `fork()` — The Process Splits

The shell calls fork(), a syscall that creates an almost-exact copy of the current process:

Before fork:
┌──────────────┐
│   Shell PID   │  single thread, single process
└──────────────┘

After fork:
┌──────────────┐     ┌──────────────┐
│   Shell PID   │     │  Child PID   │
│  (parent)     │     │  (new)       │
└──────────────┘     └──────────────┘

The child gets:

A copy of the parent’s address space (virtual memory, not physical — more on this)
A copy of file descriptors (the child inherits stdin/stdout/stderr)
A new PID
fork() returns 0 in the child, the child’s PID in the parent

This is how the shell distinguishes itself from the child:

pid_t pid = fork();
if (pid == 0) {
    // I'm the child — go execute the new program
    execve("./hello", argv, envp);
} else {
    // I'm the parent — wait for the child to finish
    waitpid(pid, &status, 0);
}

The fork is fast because of copy-on-write (COW): the kernel maps the child’s pages to the same physical pages as the parent, marked read-only. Only when either process writes to a page does the kernel actually copy it. A shell that forks and immediately execs does almost no copying at all.

🚀 Stage 3: `execve()` — ELF Loading

execve() is the syscall that transforms the child process. It replaces the entire address space with a new program loaded from disk.

3.1 The Kernel Opens the ELF File

The kernel reads the first few bytes of ./hello to verify it’s an ELF (Executable and Linkable Format) file:

// Simplified kernel code
struct elf64_hdr *ehdr = (struct elf64_hdr *)buf;
if (memcmp(ehdr->e_ident, ELFMAG, SELFMAG) != 0)
    return -ENOEXEC;  // Not an ELF

A valid ELF header starts with the four bytes \x7fELF.

3.2 The ELF Layout

The kernel parses the ELF header and program headers to understand how to map the binary into memory:

ELF File on disk:
┌─────────────────────┐
│ ELF Header          │  ← e_entry = 0x401000 (entry point)
├─────────────────────┤
│ Program Headers     │
│  - LOAD .text       │  ← maps to 0x401000, size 8K
│  - LOAD .rodata     │  ← maps after .text
│  - LOAD .data       │  ← maps to 0x600000
│  - LOAD .bss        │  ← zero-initialized
├─────────────────────┤
│ .text (code)        │
│ .rodata (constants) │
│ .data (init'd)      │
│ .got / .plt         │
└─────────────────────┘

3.3 Setting Up the Address Space

The kernel sets up the process virtual address space like this:

High address (0x7fffffff...)
┌─────────────────────────┐
│ Kernel space            │ ← inaccessible from user mode
│ (text, data, modules)   │
├─────────────────────────┤
│ Stack                   │ ← grows downward (local vars, function frames)
│ (argv, envp, auxv)      │
│ ↓↓↓↓                    │
│                         │
│ ↑↑↑↑                    │
│ Heap                    │ ← grows upward (malloc arenas, sbrk)
├─────────────────────────┤
│ .bss                    │ ← zero-initialized global data
│ .data                   │ ← initialized global data
│ .rodata                 │ ← read-only data (string literals)
│ .text                   │ ← machine code (read-only, executable)
├─────────────────────────┤
│ Reserved / low memory   │ ← typically not mapped (NULL page)
Low address (0x00000000)

Key details:

.text: The compiled machine code. Mapped read+execute only. Your CPU will fetch instructions from here.
.rodata: Read-only data (string literals, jump tables).
.data: Initialized globals/statics (e.g., int x = 42;).
.bss: Zero-initialized globals (e.g., int arr[1000];). Not stored in the ELF file — the kernel just maps a zeroed page.
Stack: Grows downward. Each function call pushes a stack frame (return address, saved registers, local variables).
Heap: Grows upward. malloc() / sbrk() manage this region.
Kernel space: The top portion of virtual address space is reserved for the kernel. User code cannot access it (segfault if it tries).

3.4 Loading ELF: `load_elf_binary()` in the Kernel

The kernel function load_elf_binary() (in fs/binfmt_elf.c) does the heavy lifting:

// Pseudocode for the kernel ELF loader
for (each program header) {
    if (p_type == PT_LOAD) {
        // Map the segment from the file into memory
        // at virtual address p_vaddr
        elf_map(bprm->file, load_bias + vaddr, elf_ppnt);
    }
}

It loops over the ELF program headers, and for each PT_LOAD segment, it calls elf_map() which creates a new virtual memory area (VMA) via mmap(). This is a lazy mapping — no disk I/O happens yet. Physical pages are allocated only when the program actually touches them (see page faults below).

📜 Stage 4: The Kernel Sets Up the Stack

Before jumping to the entry point, the kernel places argc, argv, envp, and the auxiliary vector (auxv) on the user stack:

Stack (high → low):
┌─────────────────────────┐
│ null terminator         │
│ envp[n] ... envp[0]     │  ← environment strings
│ argv[m] ... argv[0]     │  ← argument strings
│ auxv pairs              │  ← AT_PHDR, AT_ENTRY, AT_PAGESZ, AT_RANDOM...
│ envp[n] ... envp[0] ptrs│
│ argv[m] ... argv[0] ptrs│
│ argc                    │  ← number of arguments
└─────────────────────────┘

The auxiliary vector is how the kernel passes information to the dynamic linker / runtime: page size, entry point address, program header address, random bytes for ASLR, and the vDSO address.

🔗 Stage 5: Dynamic Linking

If your binary is dynamically linked (most are), the kernel doesn’t jump to main() yet. Instead, it loads the interpreter — typically /lib64/ld-linux-x86-64.so.2 — which is specified in the .interp section of the ELF:

$ readelf -l ./hello | grep INTERP
  INTERP         0x0000000000000318 0x0000000000000318 0x0000000000000318
                 0x000000000000001c 0x000000000000001c  R      0x1
      [Requesting program interpreter: /lib64/ld-linux-x86-64.so.2]

The dynamic linker (ld.so) does:

Load shared libraries: Recursively loads libc.so.6, libpthread.so.0, etc.
Resolve symbols: Maps function calls like printf() to their actual addresses in libc
Apply relocations: Patches the GOT (Global Offset Table) and PLT (Procedure Linkage Table)
Call constructors: Runs __attribute__((constructor)) functions
Jump to main(): Transfers control to _start, which calls __libc_start_main, which calls main()

This is why the first call to printf() takes longer than subsequent calls — lazy binding resolves the PLT entry on first use.

🔧 Stage 6: User Mode, Kernel Mode, and the Syscall Path

Now main() runs. When it calls printf("Hello, world!\n"), that eventually calls write(1, buf, len) — a syscall.

6.1 The Two Rings

The CPU has privilege levels:

Ring 3 (User mode): Your program runs here. Restricted — no I/O, no page table changes, no interrupts.
Ring 0 (Kernel mode): The OS runs here. Full access to hardware, memory, and privileged instructions.

To switch from user mode to kernel mode, the program must execute a syscall instruction.

6.2 The Syscall Instruction Evolution

x86 Instruction	Introduced	Cost	How it works
`int 0x80`	i386 (1993)	~200+ cycles	Software interrupt, traps to kernel interrupt handler. Slow because it goes through the full interrupt dispatch.
`sysenter`	Pentium M (2003)	~70 cycles	Intel’s fast syscall. Sets up a fixed return point, minimal state save.
`syscall`	AMD K6-2 (1998), Intel later	~50 cycles	AMD’s approach — uses MSRs (model-specific registers) for the target RIP and RSP. Faster than sysenter.

On modern x86-64 Linux, syscall is the default. The process:

User mode:
1. Set rax = syscall number (1 = write)
2. Set rdi = fd (1 = stdout)
3. Set rsi = buf pointer
4. Set rdx = count
5. Execute SYSCALL instruction

Kernel mode (entered via SYSCALL):
6. CPU saves return address (RIP) to RCX
7. CPU saves RFLAGS to R11
8. CPU loads new RIP from LSTAR MSR (kernel entry point)
9. CPU loads new RSP from kernel stack
10. Kernel's entry_64.S saves registers to pt_regs struct
11. System call handler dispatches to sys_write()
12. sys_write() validates arguments, writes to terminal
13. Returns to user mode via SYSRETQ
14. CPU restores RIP from RCX, RFLAGS from R11

6.3 The vDSO — Bypassing the Kernel for Common Operations

Some syscalls are so fast that the overhead of switching to kernel mode dominates. For these, Linux provides the vDSO (virtual Dynamic Shared Object) — a small shared library the kernel maps into every process’s address space.

The vDSO implements:

clock_gettime() — reads the TSC (Time Stamp Counter) directly in user space
gettimeofday() — same idea
getcpu() — reads the current CPU ID

// The kernel gives us a shared data page with the current time
// No syscall needed — just read memory
struct vdso_data *vd = get_vdso_page();
__int128 ts = __arch_get_hw_counter(vd->clock_mode);

The vDSO is why calling System.nanoTime() in Java a million times only takes ~30ms — it’s not really syscalling.

💥 Stage 7: Page Faults and Demand Paging

Remember how the ELF loader mapped the binary lazily? When main() tries to execute the first instruction at 0x401000:

1. CPU tries to fetch instruction at 0x401000
2. MMU looks up virtual address in page tables
3. Page table entry says "not present" (present bit = 0)
4. MMU triggers a **page fault** (#PF)
5. CPU saves state, switches to kernel mode
6. Page fault handler runs (arch/x86/mm/fault.c)
7. Kernel determines: "This is a valid VMA (we mapped it in step 3)"
8. Kernel allocates a physical page frame (4KB)
9. Kernel reads the relevant page from the ELF file on disk
10. Kernel updates the page table entry (PTE) with the physical page address
11. CPU returns to the faulting instruction — execution continues

This happens once per page (4KB on x86-64). If your program is 100KB of code, you might get ~25 page faults during startup. Each one costs ~10µs if the data is in the page cache, or ~10ms if it requires a disk read.

The result: demand paging. The kernel only loads what the program actually touches. If your program has error-handling code that never runs in normal operation, those pages are never loaded from disk.

⚡ Stage 8: Context Switching

While your process runs, other processes (including the kernel itself) also need CPU time. The kernel uses the timer interrupt (typically 100-1000 Hz) or an expired sched_yield() to trigger a context switch:

Before context switch (Process A running):
┌──────────────────────┐
│ Process A (user mode) │  ← CPU core 0
│ RIP = 0x4010ab       │
│ RSP = 0x7fff1234     │
│ RAX = 0x42           │
│ CR3 → A's page table │
└──────────────────────┘

Timer interrupt fires:
┌──────────────────────┐
│ Interrupt handler    │  ← saves A's registers to kernel stack
│ (kernel mode)        │
│                      │
│ Call schedule()      │  ← picks next process (maybe B)
│                      │
│ switch_mm(B's CR3)   │  ← switches page tables (TLB flush!)
│                      │
│ switch_to(B)         │  ← restores B's registers from kernel stack
└──────────────────────┘

After context switch (Process B running):
┌──────────────────────┐
│ Process B (user mode) │  ← CPU core 0
│ RIP = 0x555500ab     │
│ RSP = 0x7fff8888     │
│ RAX = 0x7            │
│ CR3 → B's page table │
└──────────────────────┘

A context switch costs:

Direct cost: ~1-5µs (save/restore registers, switch page tables, TLB flush)
Indirect cost: Cold caches (L1/L2/L3, TLB) — this dominates. When A switches back, most of its data has been evicted from cache.

This is why high-performance systems use busy-polling or io_uring instead of syscalls — they avoid context switches entirely.

🏁 Stage 9: Exit and Cleanup

Your program calls exit() or return from main(). The C library calls the exit_group() syscall:

1. Kernel receives exit_group syscall
2. Calls do_exit() for each thread
3. Releases mm_struct (the address space) — unmaps all VMAs
4. Closes all file descriptors
5. Reclaims physical memory pages
6. Sends SIGCHLD to the parent (the shell)
7. Process enters zombie state (kernel keeps exit code)
8. Shell's waitpid() returns the exit status
9. Kernel removes the process from the process table
10. Shell prints the next prompt

// What the kernel does in do_exit() — simplified
void do_exit(long code) {
    exit_mm(tsk);       // Tear down address space
    exit_files(tsk);    // Close all file descriptors
    exit_fs(tsk);       // Release working directory references
    exit_signals(tsk);  // Clean up pending signals
    // ...
    schedule();         // Switch to another process — this one is dead
}

📊 The Full Lifecycle

─────TIME─────►

INIT    READY         RUNNING           WAITING        RUNNING     TERMINATED
 │        │              │                 │              │             │
 │   fork() creates  │                 │              │             │
 │   child, child    │  schedule()     │  read()      │  schedule() │  exit()
 │   is READY        │  picks child    │  from disk   │  resumes    │
 │                   │                  │  blocks      │             │
 └───────────────────┴──────────────────┴─────────────┴─────────────┴────►
                     │                  │              │
                     │  execve() sets   │  I/O done    │
                     │  up address      │  child is    │
                     │  space           │  READY again │

State	Meaning
Init	Process control block (task_struct) allocated
Ready	Loaded, waiting for CPU (in the run queue)
Running	Currently executing on a CPU core
Waiting	Blocked on I/O, a lock, or a timer
Terminated	Exited, waiting for parent to reap (zombie)

🧠 Key Takeaways

fork() is fast because of copy-on-write — it maps pages, doesn’t copy them.
execve() is the loader — it parses ELF headers and builds the virtual address space.
Demand paging means the kernel loads data from disk only when the program touches it.
Syscalls switch to kernel mode via the syscall instruction (~50 cycles on modern CPUs).
The vDSO lets programs read the clock without any syscall.
Context switches are expensive primarily because of cache pollution, not register saving.
A process doesn’t truly “start running” — it transitions through init → ready → running → waiting → ready → … → terminated.

Next time you type ./program, you’ll know exactly what’s happening under the hood.

Next →
🌳 LSM Trees & Bloom Filters

[n]ext → posts/ 🧩 LeetCode vs Real Engineering Skills