 Command

Pranesh Nikhar's personal site. Vim-style keybinds for navigation; theme + font pickers below.

Theme
 Font Body Code
Reader
Keybinds
Navigation
j / ↓ Next item k / ↑ Previous item g First item in region G Last item in region zz Center focused item h / l Move left/right region ] / [ Next/previous heading } / { Next/previous block d / u Half-page down/up
Layout
<zh> / <zl> Toggle left/right sidebar <zr> Toggle reader view <zj> / <zk> Focus main/navbar <S-h/j/k/l> Focus left/main/navbar/right ⌃H / ⌃L Focus left/right sidebar ⌃J / ⌃K Focus main/navbar ⇧C / ⇧E Collapse / expand all sections
Dialogs
⌃P / : Command palette ⌃X Theme picker / Search ? Show keybinds Esc / ⌃C Close dialog
History
n Next document b Previous document ⌃O History back ⌃I History forward
 Search
about: Pranesh Nikhar about/more: 🪪 More docs/test: Docs Test ideas: 💡 Ideas more: ➕ More now: Now posts: 📬 Posts projects: 📚 Projects webtui: Style posts/agentic-eda: 📊 AgenticEDA — Automated Exploratory Data Analysis with LangGraph posts/cap-theorem-outage-story: 🌐 CAP Theorem with a Real Outage Story posts/codepilot: ✈️ CodePilot — From Requirements to Deployable FastAPI Backend posts/common-auth-mistakes: 🔐 Common Auth Mistakes Developers Make posts/compiled-vs-jit-vs-interpreted: ⚡ Why Is X Language Fast or Slow? — Compiled vs JIT vs Interpreted posts/cs-degree-gaps: 🎓 Things CS Degrees Don't Teach You posts/cve-2025-breach-analysis: 🛡️ CVE-2025 Breach Analysis — Midnight Blizzard and the 16 Billion Credential Leak posts/fixloop: 🔄 FixLoop — AI Agent Loop for Self-Correcting Code posts/functional-vs-oop: ⚡ Functional vs OOP — Same Problem, Both Ways posts/getman: 🦾 Getman — Declarative API Tester for CLI & TUI posts/how-compilers-optimize: ⚙️ How Compilers Actually Optimize Your Code posts/http3-quic: ⚡ HTTP/3 and QUIC — Why They Matter posts/leetcode-vs-engineering: 🧩 LeetCode vs Real Engineering Skills posts/llm-from-scratch: 🧠 LLM from Scratch — GPT-Style Transformer in PyTorch posts/lsm-trees-bloom-filters: 🌳 LSM Trees & Bloom Filters — Production Deep Dive posts/mcp-workflow-builder: 🔧 MCP Workflow Builder — Visual DAG for MCP Tools posts/persistent-memory: 🧠 Persistent Memory — Long-Term Memory for AI Agents via MCP posts/playcli: 🎬 PlayCLI — Terminal Video Player posts/postgres-mvcc: 🗄️ How PostgreSQL MVCC Works — Multi-Version Concurrency Control Deep Dive posts/raft-consensus: ⛵ Raft Consensus Algorithm Explained posts/rust-borrow-checker: 🦀 Rust Borrow Checker — Catches Real Bugs posts/titan: 🤖 Titan — Terminal AI Coding Agent posts/what-happens-url: 🌐 What Happens Between Typing a URL and Seeing the Page posts/what-happens-when-you-run-a-program: ⚙️ What Actually Happens When You Run a Program posts/zero-knowledge-proofs: 🔐 Zero-Knowledge Proofs Explained Simply webtui/components/accordion: Accordion webtui/components/badge: Badge webtui/components/button: Button webtui/components/checkbox: Checkbox webtui/components/dialog: Dialog webtui/components/input: Input webtui/components/popover: Popover webtui/components/pre: Pre webtui/components/progress: Progress webtui/components/radio: Radio webtui/components/range: Range webtui/components/separator: Separator webtui/components/spinner: Spinner webtui/components/switch: Switch webtui/components/table: Table webtui/components/textarea: Textarea webtui/components/tooltip: Popover webtui/components/typography: Typography webtui/components/view: View webtui/contributing/contributing: Contributing webtui/contributing/contributing: ## Local Development webtui/contributing/contributing: ## Issues webtui/contributing/contributing: ## Pull Requests webtui/contributing/style-guide: Style Guide webtui/contributing/style-guide: ## CSS Units webtui/contributing/style-guide: ## Selectors webtui/contributing/style-guide: ## Documentation webtui/installation/astro: Astro webtui/installation/astro: ## Scoping webtui/installation/astro: ### Frontmatter Imports webtui/installation/astro: ### ‹style› tag webtui/installation/astro: ### Full Library Import webtui/installation/nextjs: Next.js webtui/installation/vite: Vite webtui/plugins/plugin-dev: Developing Plugins webtui/plugins/plugin-dev: ### Style Layers webtui/plugins/plugin-nf: Nerd Font Plugin webtui/plugins/theme-catppuccin: Catppuccin Theme webtui/plugins/theme-custom: Custom Theme webtui/plugins/theme-everforest: Everforest Theme webtui/plugins/theme-gruvbox: Gruvbox Theme webtui/plugins/theme-nord: Nord Theme webtui/plugins/theme-vitesse: Vitesse Theme webtui/start/ascii-boxes: ASCII Boxes webtui/start/changelog: Changelog webtui/start/installation: Installation webtui/start/installation: ## Installation webtui/start/installation: ## Using CSS webtui/start/installation: ## Using ESM webtui/start/installation: ## Using a CDN webtui/start/installation: ## Full Library Import webtui/start/installation: ### CSS webtui/start/installation: ### ESM webtui/start/installation: ### CDN webtui/start/intro: Introduction webtui/start/intro: ## Features webtui/start/plugins: Plugins webtui/start/plugins: ## Official Plugins webtui/start/plugins: ### Themes webtui/start/plugins: ## Community Plugins webtui/start/theming: Theming webtui/start/theming: ## CSS Variables webtui/start/theming: ### Font Styles webtui/start/theming: ### Colors webtui/start/theming: ### Light & Dark webtui/start/theming: ## Theme Plugins webtui/start/theming: ### Using Multiple Theme Accents webtui/start/tuis-vs-guis: TUIs vs GUIs webtui/start/tuis-vs-guis: ## Monospace Fonts webtui/start/tuis-vs-guis: ## Character Cells
 Theme Current: Light j/k or ↑/↓ + Enter

⛵ Raft Consensus Algorithm Explained

How Raft solves distributed consensus through leader election, log replication, and safety guarantees — with the bus tour guide analogy, term-based time, and real-world systems like etcd, Consul, and CockroachDB.

The Distributed Consensus Problem

Distributed consensus is deceptively simple to state: given a group of machines, how do you get them to agree on a value, even when some machines fail? In practice, it’s one of the hardest problems in computer science.

Two foundational results frame the difficulty:

The Two Generals Problem: Two armies must coordinate an attack on a fortified city. They can only communicate by sending messengers through enemy territory. A messenger can be captured (message lost). The generals can never be certain their agreement was received — any protocol can be disrupted by the last message being lost. This proves that consensus over an unreliable channel is impossible without some bound on message delay.

The FLP Impossibility Result (Fischer, Lynch, Paterson, 1985): In an asynchronous distributed system where even one process can crash, there is no deterministic algorithm that guarantees consensus in finite time. The paper is only 16 pages and remains one of the most cited results in distributed systems.

So how do systems like etcd, Consul, and CockroachDB achieve consensus? They work around FLP by using timeouts — introducing a weak synchrony assumption. If a leader doesn’t hear from a majority within a timeout, it triggers a new election. This sidesteps the impossibility result because the system is partially synchronous: it behaves asynchronously but assumes bounded message delay during periods of stability.

“FLP says consensus is impossible in a purely asynchronous model. Raft says: add a timeout, and suddenly it’s possible.” — paraphrase of the Raft paper’s motivation

Raft was designed by Diego Ongaro and John Ousterhout at Stanford in 2014, explicitly as a more understandable alternative to Paxos. The Raft paper won the SIGOPS Hall of Fame Award and has been implemented in dozens of production systems.


Raft’s Three Sub-Problems

Raft decomposes consensus into three largely independent sub-problems:

  1. Leader Election: Pick one server to act as the coordinator
  2. Log Replication: The leader accepts client requests and replicates them to followers
  3. Safety: Ensure that if any server has applied a log entry, no other server can apply a different entry at the same log index

This decomposition is what makes Raft teachable. Paxos bundles these concerns together; Raft separates them cleanly.


The Bus Tour Guide Analogy

Imagine a bus full of tourists. The bus has a driver (the leader), and passengers can shout suggestions for where to go next. The driver’s job is to:

  1. Listen for suggestions (client requests)
  2. Announce the next destination over the intercom (AppendEntries RPC)
  3. Wait for a majority of awake passengers to acknowledge (commit rule)
  4. Drive there (apply to state machine)

If the driver falls asleep (crashes), the tourniquet rule kicks in: the passenger who’s been awake the longest (has the most up-to-date log) volunteers to be the new driver. They wait a random amount of time before volunteering to avoid multiple people shouting at once.

If the passengers disagree on which stop was announced last, the one with the longer itinerary wins.


Server States and Leader Election

Every Raft server is in one of three states at any time:

                     ┌─────────────┐
          timeout     │             │     timeout
     ┌───────────────│  FOLLOWER   |───────────────┐
     │               │             │               │
     │               └──────┬──────┘               │
     │                      │                      │
     │              receives │ majority             │
     │               votes  │                      │
     │                      ▼                      │
     │               ┌─────────────┐               │
     │               │             │               │
     └───────────────|  CANDIDATE  │───────────────┘
     │               │             │               │
     │               └──────┬──────┘               │
     │                      │                      │
     │            discovers │ leader or             │
     │            higher    │ term                  │
     │                      ▼                      │
     │               ┌─────────────┐               │
     │               │             │               │
     └───────────────│   LEADER    │───────────────┘
                     │             │      crashes
                     └─────────────┘

Terms are Raft’s version of logical time. Each term begins with an election. If the election succeeds, the term continues with normal operation under one leader. If the election is split (no candidate gets a majority), the term ends immediately and a new term with a higher number begins.

Term 1         Term 2          Term 3           Term 4
│─────│───│────│─────│────│────│─────│────│────│─────│
│  L1 │   │    │  L2 │    │    │ L3  │    │    │     │
│lead │   │    │lead │    │    │lead │    │    │     │
└─────┘   └────┘     └────┘    └─────┘    └────┘     │
  election  split   election   stable     election    │
  succeeds  vote    succeeds   leader     (ongoing)  │

Election timeouts are randomized between 150–300ms. This randomization is crucial: it dramatically reduces the chance of a split vote. If two candidates both time out at the same instant, they both start an election and neither gets a majority. Randomized timeouts make this probability vanishingly small.

A candidate wins an election when it receives votes from a majority of the cluster (itself + N/2 others). It then immediately sends heartbeat AppendEntries to all followers to establish authority.


Log Replication

Once elected, the leader handles all client requests. Each client request is an operation that will be applied to the replicated state machine (e.g., a key-value store SET).

Client Request: SET x = 3

Leader's Log:
index 1: SET x = 1
index 2: SET y = 2
index 3: SET x = 3 ← leader appends to local log
              ↓ AppendEntries RPC
Follower A:  [index 3: SET x = 3]  → acknowledged
Follower B:  [index 3: SET x = 3]  → acknowledged

After majority acknowledges:
Leader marks index 3 as committed
Applies to state machine: x = 3
Responds to client: OK

The AppendEntries RPC carries:

AppendEntries {
  term: currentTerm                // Prevents stale leaders
  leaderId: serverId               // For followers to redirect clients
  prevLogIndex: int                // Index of log entry immediately preceding new ones
  prevLogTerm: int                 // Term of prevLogIndex entry
  entries: []LogEntry              // Log entries to store (empty = heartbeat)
  leaderCommit: int                // Leader's commitIndex
}

The consistency check (prevLogIndex/prevLogTerm) is Raft’s most important mechanism. When a follower receives an AppendEntries, it checks whether its log has an entry at prevLogIndex with matching prevLogTerm. If not, it rejects the request. This is how Raft ensures logs stay consistent.

If a follower rejects an AppendEntries, the leader decrements nextIndex[follower] and retries with older entries. This is called the backward search — the leader walks backward through its log until it finds an entry the follower agrees on.


The Commit Rule

A log entry is committed when the leader has replicated it to a majority of the cluster. Once committed, the leader applies it to its state machine and responds to the client.

But here’s the subtle part: a leader cannot commit entries from previous terms. If a newly elected leader finds uncommitted entries from a previous term in its log, it must keep them uncommitted until it writes an entry from its own term and sees that entry commit.

S1: 1 2 3 4 5     ← leader (term 4)
S2: 1 2 3 4       ← has all entries through index 4
S3: 1 2 3         ← missing index 4

If S3 becomes leader (term 5), it CANNOT commit entry 4
until it writes entry 5 (its own term) and entry 5 commits.

This prevents a situation where a newly elected leader incorrectly commits stale entries that might contradict entries on other servers.


Safety and the Election Restriction

Raft’s safety property is: if a log entry is committed at a given index, no other server can apply a different entry at that index.

The election restriction is the mechanism that guarantees this. A candidate can only become leader if its log is at least as up-to-date as a majority of the cluster. “Up-to-date” is defined by comparing the last log entry:

  1. If the entries have different terms, the entry with the higher term is more up-to-date
  2. If the terms are the same, the longer log is more up-to-date
def is_more_up_to_date(a_last_index, a_last_term, b_last_index, b_last_term):
    if a_last_term != b_last_term:
        return a_last_term > b_last_term
    return a_last_index > b_last_index

When a follower receives a RequestVote RPC, it checks whether the candidate’s log is at least as up-to-date as its own. If not, it denies the vote. This ensures that the leader always has the most complete log, and committed entries are never lost.


Cluster Membership Changes

Changing the set of servers in a Raft cluster (adding or removing nodes) is surprisingly tricky. You can’t just switch configurations atomically — distributed systems have no global clock.

Raft handles this with joint consensus. The transition goes through an intermediate configuration where both the old and new configurations have authority:

C_old → C_old,new → C_new

Phase 1: Leader appends C_old,new to its log and replicates it
Phase 2: C_old,new is committed when a majority of BOTH configurations acknowledge it
Phase 3: Leader appends C_new and replicates it
Phase 4: C_new is committed when a majority of the new config acknowledges it

During joint consensus:

  • Log entries are replicated to all servers in both configurations
  • Any server from either configuration can be leader
  • Safety (majority) requires agreement from both configurations independently

This approach guarantees that the cluster never has two disjoint majorities during the transition.


Real-World Systems Using Raft

SystemUse CaseNotes
etcdDistributed key-value store (Kubernetes’ brain)CoreOS, ~10k ops/s on fast hardware
ConsulService discovery + configurationHashiCorp, uses Raft for its consensus layer
CockroachDBDistributed SQL databaseEach range (shard) runs its own Raft group
TiKVDistributed KV storeUsed by TiDB, written in Rust
MongoDBDocument databaseUses a Raft-based replication protocol
Apache Kafka (KRaft)Event streamingReplaced ZooKeeper with Raft-based controller in 2.8+

etcd is arguably the most prominent example — every Kubernetes cluster uses it to store all cluster state. If etcd’s Raft group loses quorum, the entire Kubernetes cluster stops accepting changes.

CockroachDB runs one Raft group per range (typically 64MB of data). A 100-node cluster might have 10,000+ independent Raft groups running concurrently. This is Raft at massive scale — each represents a tiny consensus domain.


Summary

Raft’s design philosophy is understandability first. By decomposing consensus into leader election, log replication, and safety — and by using randomized timeouts to break symmetry — Raft gives us a protocol that is both provably correct and implementable by a single developer in a weekend.

The tradeoff: Raft requires a stable leader and all requests flow through it. This means the leader can become a bottleneck under high write loads. Systems like CockroachDB mitigate this by sharding data into many independent Raft groups, each with its own leader.

The Raft Paper — Ongaro & Ousterhout, USENIX ATC 2014 Raft Visualization — interactive demo by Ben Johnson etcd Raft Implementation — reference Go implementation


📖 Series Navigation

 praneshnikhar.site / posts / raft-consensus · Top 1:1