⛵ Raft Consensus Algorithm Explained
How Raft solves distributed consensus through leader election, log replication, and safety guarantees — with the bus tour guide analogy, term-based time, and real-world systems like etcd, Consul, and CockroachDB.
The Distributed Consensus Problem
Distributed consensus is deceptively simple to state: given a group of machines, how do you get them to agree on a value, even when some machines fail? In practice, it’s one of the hardest problems in computer science.
Two foundational results frame the difficulty:
The Two Generals Problem: Two armies must coordinate an attack on a fortified city. They can only communicate by sending messengers through enemy territory. A messenger can be captured (message lost). The generals can never be certain their agreement was received — any protocol can be disrupted by the last message being lost. This proves that consensus over an unreliable channel is impossible without some bound on message delay.
The FLP Impossibility Result (Fischer, Lynch, Paterson, 1985): In an asynchronous distributed system where even one process can crash, there is no deterministic algorithm that guarantees consensus in finite time. The paper is only 16 pages and remains one of the most cited results in distributed systems.
So how do systems like etcd, Consul, and CockroachDB achieve consensus? They work around FLP by using timeouts — introducing a weak synchrony assumption. If a leader doesn’t hear from a majority within a timeout, it triggers a new election. This sidesteps the impossibility result because the system is partially synchronous: it behaves asynchronously but assumes bounded message delay during periods of stability.
“FLP says consensus is impossible in a purely asynchronous model. Raft says: add a timeout, and suddenly it’s possible.” — paraphrase of the Raft paper’s motivation
Raft was designed by Diego Ongaro and John Ousterhout at Stanford in 2014, explicitly as a more understandable alternative to Paxos. The Raft paper won the SIGOPS Hall of Fame Award and has been implemented in dozens of production systems.
Raft’s Three Sub-Problems
Raft decomposes consensus into three largely independent sub-problems:
- Leader Election: Pick one server to act as the coordinator
- Log Replication: The leader accepts client requests and replicates them to followers
- Safety: Ensure that if any server has applied a log entry, no other server can apply a different entry at the same log index
This decomposition is what makes Raft teachable. Paxos bundles these concerns together; Raft separates them cleanly.
The Bus Tour Guide Analogy
Imagine a bus full of tourists. The bus has a driver (the leader), and passengers can shout suggestions for where to go next. The driver’s job is to:
- Listen for suggestions (client requests)
- Announce the next destination over the intercom (AppendEntries RPC)
- Wait for a majority of awake passengers to acknowledge (commit rule)
- Drive there (apply to state machine)
If the driver falls asleep (crashes), the tourniquet rule kicks in: the passenger who’s been awake the longest (has the most up-to-date log) volunteers to be the new driver. They wait a random amount of time before volunteering to avoid multiple people shouting at once.
If the passengers disagree on which stop was announced last, the one with the longer itinerary wins.
Server States and Leader Election
Every Raft server is in one of three states at any time:
┌─────────────┐
timeout │ │ timeout
┌───────────────│ FOLLOWER |───────────────┐
│ │ │ │
│ └──────┬──────┘ │
│ │ │
│ receives │ majority │
│ votes │ │
│ ▼ │
│ ┌─────────────┐ │
│ │ │ │
└───────────────| CANDIDATE │───────────────┘
│ │ │ │
│ └──────┬──────┘ │
│ │ │
│ discovers │ leader or │
│ higher │ term │
│ ▼ │
│ ┌─────────────┐ │
│ │ │ │
└───────────────│ LEADER │───────────────┘
│ │ crashes
└─────────────┘
Terms are Raft’s version of logical time. Each term begins with an election. If the election succeeds, the term continues with normal operation under one leader. If the election is split (no candidate gets a majority), the term ends immediately and a new term with a higher number begins.
Term 1 Term 2 Term 3 Term 4
│─────│───│────│─────│────│────│─────│────│────│─────│
│ L1 │ │ │ L2 │ │ │ L3 │ │ │ │
│lead │ │ │lead │ │ │lead │ │ │ │
└─────┘ └────┘ └────┘ └─────┘ └────┘ │
election split election stable election │
succeeds vote succeeds leader (ongoing) │
Election timeouts are randomized between 150–300ms. This randomization is crucial: it dramatically reduces the chance of a split vote. If two candidates both time out at the same instant, they both start an election and neither gets a majority. Randomized timeouts make this probability vanishingly small.
A candidate wins an election when it receives votes from a majority of the cluster (itself + N/2 others). It then immediately sends heartbeat AppendEntries to all followers to establish authority.
Log Replication
Once elected, the leader handles all client requests. Each client request is an operation that will be applied to the replicated state machine (e.g., a key-value store SET).
Client Request: SET x = 3
Leader's Log:
index 1: SET x = 1
index 2: SET y = 2
index 3: SET x = 3 ← leader appends to local log
↓ AppendEntries RPC
Follower A: [index 3: SET x = 3] → acknowledged
Follower B: [index 3: SET x = 3] → acknowledged
After majority acknowledges:
Leader marks index 3 as committed
Applies to state machine: x = 3
Responds to client: OK
The AppendEntries RPC carries:
AppendEntries {
term: currentTerm // Prevents stale leaders
leaderId: serverId // For followers to redirect clients
prevLogIndex: int // Index of log entry immediately preceding new ones
prevLogTerm: int // Term of prevLogIndex entry
entries: []LogEntry // Log entries to store (empty = heartbeat)
leaderCommit: int // Leader's commitIndex
}
The consistency check (prevLogIndex/prevLogTerm) is Raft’s most important mechanism. When a follower receives an AppendEntries, it checks whether its log has an entry at prevLogIndex with matching prevLogTerm. If not, it rejects the request. This is how Raft ensures logs stay consistent.
If a follower rejects an AppendEntries, the leader decrements nextIndex[follower] and retries with older entries. This is called the backward search — the leader walks backward through its log until it finds an entry the follower agrees on.
The Commit Rule
A log entry is committed when the leader has replicated it to a majority of the cluster. Once committed, the leader applies it to its state machine and responds to the client.
But here’s the subtle part: a leader cannot commit entries from previous terms. If a newly elected leader finds uncommitted entries from a previous term in its log, it must keep them uncommitted until it writes an entry from its own term and sees that entry commit.
S1: 1 2 3 4 5 ← leader (term 4)
S2: 1 2 3 4 ← has all entries through index 4
S3: 1 2 3 ← missing index 4
If S3 becomes leader (term 5), it CANNOT commit entry 4
until it writes entry 5 (its own term) and entry 5 commits.
This prevents a situation where a newly elected leader incorrectly commits stale entries that might contradict entries on other servers.
Safety and the Election Restriction
Raft’s safety property is: if a log entry is committed at a given index, no other server can apply a different entry at that index.
The election restriction is the mechanism that guarantees this. A candidate can only become leader if its log is at least as up-to-date as a majority of the cluster. “Up-to-date” is defined by comparing the last log entry:
- If the entries have different terms, the entry with the higher term is more up-to-date
- If the terms are the same, the longer log is more up-to-date
def is_more_up_to_date(a_last_index, a_last_term, b_last_index, b_last_term):
if a_last_term != b_last_term:
return a_last_term > b_last_term
return a_last_index > b_last_index
When a follower receives a RequestVote RPC, it checks whether the candidate’s log is at least as up-to-date as its own. If not, it denies the vote. This ensures that the leader always has the most complete log, and committed entries are never lost.
Cluster Membership Changes
Changing the set of servers in a Raft cluster (adding or removing nodes) is surprisingly tricky. You can’t just switch configurations atomically — distributed systems have no global clock.
Raft handles this with joint consensus. The transition goes through an intermediate configuration where both the old and new configurations have authority:
C_old → C_old,new → C_new
Phase 1: Leader appends C_old,new to its log and replicates it
Phase 2: C_old,new is committed when a majority of BOTH configurations acknowledge it
Phase 3: Leader appends C_new and replicates it
Phase 4: C_new is committed when a majority of the new config acknowledges it
During joint consensus:
- Log entries are replicated to all servers in both configurations
- Any server from either configuration can be leader
- Safety (majority) requires agreement from both configurations independently
This approach guarantees that the cluster never has two disjoint majorities during the transition.
Real-World Systems Using Raft
| System | Use Case | Notes |
|---|---|---|
| etcd | Distributed key-value store (Kubernetes’ brain) | CoreOS, ~10k ops/s on fast hardware |
| Consul | Service discovery + configuration | HashiCorp, uses Raft for its consensus layer |
| CockroachDB | Distributed SQL database | Each range (shard) runs its own Raft group |
| TiKV | Distributed KV store | Used by TiDB, written in Rust |
| MongoDB | Document database | Uses a Raft-based replication protocol |
| Apache Kafka (KRaft) | Event streaming | Replaced ZooKeeper with Raft-based controller in 2.8+ |
etcd is arguably the most prominent example — every Kubernetes cluster uses it to store all cluster state. If etcd’s Raft group loses quorum, the entire Kubernetes cluster stops accepting changes.
CockroachDB runs one Raft group per range (typically 64MB of data). A 100-node cluster might have 10,000+ independent Raft groups running concurrently. This is Raft at massive scale — each represents a tiny consensus domain.
Summary
Raft’s design philosophy is understandability first. By decomposing consensus into leader election, log replication, and safety — and by using randomized timeouts to break symmetry — Raft gives us a protocol that is both provably correct and implementable by a single developer in a weekend.
The tradeoff: Raft requires a stable leader and all requests flow through it. This means the leader can become a bottleneck under high write loads. Systems like CockroachDB mitigate this by sharding data into many independent Raft groups, each with its own leader.
The Raft Paper — Ongaro & Ousterhout, USENIX ATC 2014 Raft Visualization — interactive demo by Ben Johnson etcd Raft Implementation — reference Go implementation