π CAP Theorem with a Real Outage Story
CAP theorem defined, why "pick two" is wrong, real outage stories from GitHub and DynamoDB, CP vs AP systems, CRDTs, tunable consistency, and a trade-off decision table for real workloads.
π§ The Most Misunderstood Theorem in Distributed Systems
Every developer has heard βCAP theorem: pick two of Consistency, Availability, Partition Tolerance.β This is wrong β or at least dangerously incomplete.
The CAP theorem (Brewerβs conjecture, proven by Gilbert and Lynch in 2002) actually says:
When a network partition occurs, you must choose between consistency and availability.
Not at design time. At runtime, during a partition. The βpick twoβ framing makes it sound like you choose your trade-off once during architecture design. In reality, you must design for partitions (P is non-negotiable), then decide what happens when they occur.
Letβs look at what CAP actually means, what real outages teach us, and how systems like DynamoDB and Cassandra implement the trade-offs in practice.
π CAP Defined
CAP TRIANGLE:
ββββββββββ
β β
βββββββββ€ CP βββββββββ
β β β β
β ββββββββββ β
βΌ βΌ
ββββββββββ ββββββββββ
β β Partition β β
β CA βββββββββββββββΊβ AP β
β β (not real) β β
ββββββββββ ββββββββββ
| Property | Meaning |
|---|---|
| Consistency | Every read receives the most recent write or an error. All nodes see the same data at the same time (linearizability). |
| Availability | Every request receives a (non-error) response, without guarantee that it contains the most recent write. |
| Partition Tolerance | The system continues to operate despite an arbitrary number of messages being dropped or delayed between nodes. |
Critical insight: In a distributed system, partitions are inevitable β network switches fail, packets are dropped, links degrade. You must tolerate partitions (P). The real question is: during a partition, do you prefer C or A?
Why CA Is a Lie
A βCAβ system (Consistent + Available, no Partition Tolerance) would need a perfectly reliable network β which doesnβt exist. A single-node database is CA by default, but no distributed system can be both C and A when the network splits. If you claim your system is βCA,β it means you havenβt thought about partitions.
π₯ Real Outage #1: GitHub Availability Incident (October 2018)
On October 21, 2018, GitHub experienced its most severe outage in years. A network partition between their US East Coast and US West Coast data centers caused 24 hours of degraded service.
What Happened
GitHub uses MySQL with Orchestrator for automatic failover. The partition:
GitHub's US East DC GitHub's US West DC
ββββββββββββββββββββββ partition ββββββββββββββββββββββ
β MySQL Primary βββββββββββββββββββ MySQL Replica β
β (writable) β β (read-only) β
ββββββββββββββββββββββ ββββββββββββββββββββββ
During the network partition, Orchestrator (which manages MySQL failover) determined that the US West replica could not reach the US East primary. Orchestratorβs automated failover logic promoted the US West replica to primary. But the US East node was still running as primary β it just couldnβt talk to US West.
Result: Two MySQL primaries accepting writes (split-brain).
Since the systems that read from the US East primary (GitHub.com, API, Issues, Pull Requests) continued reading from it, and the systems that read from the newly-promoted US West primary also continued, the two data sets diverged.
User makes a PR comment on github.com
β hits US East β writes to MySQL-1 (old primary)
User makes another comment
β hits US West β writes to MySQL-2 (new primary)
After partition heals:
MySQL-1 has: "comment A"
MySQL-2 has: "comment B"
Replication can't merge these β conflict!
GitHub had to:
- Identify which primary had the authoritative data
- Manually resolve data conflicts
- Rebuild replicas from the authoritative primary
- Accept some data loss (some comments/issues lost)
CAP Analysis
GitHubβs MySQL setup was configured as a CP system β consistent replication with strict ordering. But the automatic failover violated the C guarantee by allowing writes to two primaries. During the partition, GitHub chose availability (keep writing) when their automation ran, but the system was designed for consistency. The mismatch caused the 24-hour outage.
Lesson: If you design for CP, you need to actually refuse writes during a partition. GitHubβs Orchestrator accidentally made the system AP during the outage, with all the conflict-resolution pain that entails.
π₯ Real Outage #2: DynamoDBβs Pounding (AWS re:Invent 2012)
At AWS re:Invent 2012, Netflixβs presentation revealed how DynamoDBβs design choices during partitions affected real users.
The Setup
DynamoDB is built on Dynamo principles (the 2007 Dynamo paper). Itβs an AP system by default: during a partition, DynamoDB prefers to accept writes on both sides and reconcile later.
DynamoDB Ring (simplified):
βββββββ
β N1 β
/ \
βββββββ βββββββ
β N2 β β N3 β
βββββββ βββββββ
\ /
βββββββ
β N4 β
βββββββ
Each DynamoDB table has:
- N (replication factor, default 3)
- R (read quorum size)
- W (write quorum size)
For strong consistency: R + W > N (e.g., R=2, W=2, N=3)
For eventual consistency: W = 1, R = 1
What Happened
During the re:Invent keynote demo, DynamoDBβs request rates for some tables hit unexpected levels. The systemβs partition detection kicked in, and some tables became unavailable for strongly-consistent reads while the partition was being resolved.
Normal: Partition:
Read "key_xyz": Read "key_xyz" (strong):
R β N1, N2 (strong) R β N1 (can't reach N2!)
N1: value β Can't reach R quorum
N2: value β Return error
β Return value
Read "key_xyz" (eventual):
R β N1
N1: value (may be stale)
β Return value
The AP trade-off in action: during a partition, DynamoDB refused strongly-consistent reads (because it couldnβt assemble a full quorum) but continued to accept eventually-consistent reads and all writes.
DynamoDB also offers tunable consistency β you choose per-request:
# Eventually consistent (default β faster, cheaper)
response = table.get_item(Key={'pk': '123'})
# β "EventuallyConsistent" = True (half the read capacity cost)
# Strongly consistent (slower, 2Γ RCU cost)
response = table.get_item(Key={'pk': '123'}, ConsistentRead=True)
# β Returns the latest write or an error
The cost difference is real: strongly-consistent reads consume 2Γ the read capacity units because DynamoDB must contact all nodes in the quorum, not just the fastest replica.
ποΈ CP Systems: PostgreSQL Sync Replication
A classic CP design. PostgreSQL with synchronous replication:
Client writes "x = 42"
β
βΌ
ββββββββββββββββ
β PostgreSQL β WAL flushed to disk β
β Primary β Waiting for replica...
ββββββββ¬ββββββββ
β WAL record
βΌ
ββββββββββββββββ
β PostgreSQL β WAL flushed to disk β
β Replica β Sends ACK to primary
ββββββββ¬ββββββββ
β ACK
βΌ
ββββββββββββββββ
β Primary β Write confirmed to client
β returns OK β
ββββββββββββββββ
During a Partition
Client writes "x = 42"
β
βΌ
ββββββββββββββββ
β PostgreSQL β WAL flushed β
β Primary β Waiting for replica ACK...
ββββββββ¬ββββββββ
β Partition! Packet dropped!
βΌ
ββββββββββββββββ
β β Replica β Unreachable
β β
ββββββββββββββββ
After timeout:
β Primary refuses the write!
β Client gets: "ERROR: could not serialize access"
β Primary is still serving reads, still alive
β But writes are blocked until replica comes back
This is the CP trade-off: you get consistency (if the replica canβt confirm the write, the write doesnβt happen) at the cost of availability (writes fail during the partition).
PostgreSQL also supports quorum sync (PostgreSQL 13+): you specify that G out of N replicas must ACK. If G=2, N=3, you lose one replica but still accept writes. This is a hybrid β youβre trading availability granularity.
π AP Systems: Cassandra
Cassandra is the most prominent AP system. Itβs a Dynamo-style database (same lineage as DynamoDB):
Cassandra Ring:
Each row has a partition key β determines coordinator node
Write "x = 42":
1. Client sends to any node (coordinator)
2. Coordinator writes to all replicas in parallel
3. Responds to client after W nodes acknowledge
Read "x":
1. Coordinator queries R replicas
2. Picks the most recent version (by timestamp)
3. If versions diverge β read repair or hinted handoff
During a Partition
Replicas: N1, N2, N3 (RF=3)
Partition splits the cluster:
Group A (reachable): N1, N2
Group B (isolated): N3
Write "x = 42" with W=2 (consistency level ONE):
β N1, N2 acknowledge β client gets OK
β N3 is missed β but it's fine! W=1 requires 1 node
Later, partition heals:
β N3 has old value for x
β Read repair triggers during the next read
β OR: hinted handoff replays the write to N3
β OR: anti-entropy repair runs periodically
In Cassandra, you choose consistency level per operation:
-- Strongest: QUORUM (R + W > RF)
SELECT * FROM users WHERE id = 123
CONSISTENCY QUORUM;
-- Fastest: ONE (eventual)
SELECT * FROM users WHERE id = 123
CONSISTENCY ONE;
-- Tolerance: ANY (write to coordinator's memory, even if all replicas down)
INSERT INTO users (id, name) VALUES (123, 'Alice')
CONSISTENCY ANY;
| Consistency Level | R / W | Behavior During Partition |
|---|---|---|
ANY | W=any | Write accepted by coordinator β may be lost on coordinator crash |
ONE | W=1 | Write to any single replica β fastest, most available |
LOCAL_QUORUM | R=2, W=2 | Quorum within a single datacenter β ignores cross-DC |
EACH_QUORUM | R=2, W=2 (each DC) | Strong but requires all DCs β unavailable during cross-DC partition |
ALL | R=3, W=3 | Write to all replicas β zero tolerance for failure |
SERIAL | R=quorum + paxos | Linearizable consistency via Paxos β slowest |
𧬠CRDTs: Reconciling Conflicts Automatically
Conflict-free Replicated Data Types (CRDTs) are the mechanism that makes AP systems work without human intervention. They provide automatic conflict resolution based on mathematical properties.
State-based CRDT (CvRDT)
Each node maintains a state that can be merged with any other nodeβs state using a commutative, associative, idempotent merge function:
# A Grow-Only Counter (G-Counter)
class GCounter:
def __init__(self, node_id, num_nodes):
self.node_id = node_id
self.counts = [0] * num_nodes
def increment(self):
self.counts[self.node_id] += 1
def value(self):
return sum(self.counts)
def merge(self, other):
# Element-wise max β commutative, associative, idempotent
for i in range(len(self.counts)):
self.counts[i] = max(self.counts[i], other.counts[i])
# Node A: increment β [1, 0, 0]
# Node B: increment β [0, 1, 0]
# After partition + merge: [1, 1, 0] β value = 2 (correct!)
Operation-based CRDT (CmRDT)
Instead of merging states, nodes broadcast operations. If all operations are commutative, the order doesnβt matter:
# A Grow-Only Set (G-Set)
class GSet:
def __init__(self):
self.elements = set()
def add(self, e):
self.elements.add(e) # Idempotent: adding twice is same as once
def merge(self, other):
self.elements |= other.elements # Union is commutative
Real CRDT Implementations
| System | CRDT Type | Real Usage |
|---|---|---|
| Riak | State-based (vectors) | Riakβs βlast write winsβ is a simple CRDT |
| Redis Enterprise | CRDT sets, counters, maps | Active-Active Redis geo-distributed |
| Automerge (JavaScript) | Multi-Value Registers + Sequences | Collaborative editing (like Google Docs) |
| delta-CRDTs | State-based, but sends diffs | Riak 2.0, NDN (Naspers) |
| SoundCloud | Custom CRDTs | Playlist ordering across devices |
π― Practical Trade-off Decision Table
When would you choose CP vs AP? Hereβs a decision table:
| Use Case | CAP Choice | Why | Example Systems |
|---|---|---|---|
| Banking / Ledger | CP | Cannot lose or duplicate transactions. Refuse writes during partition rather than risk inconsistency. | PostgreSQL sync replication, Spanner |
| DNS | AP | Better to serve a slightly stale IP than return error. The internet itself works this way. | All DNS servers (AP by necessity) |
| Shopping cart | AP | Losing a cart item is worse than briefly seeing a stale cart. CRDTs reconcile smoothly. | DynamoDB, Cassandra |
| User sessions | AP | Stale session data (e.g., showing user as logged out for 1 second) is acceptable. Downtime is not. | Redis Cluster, ElastiCache |
| Stock inventory | CP | Overselling stock due to inconsistent counts costs real money and trust. | MySQL sync replication, PostgreSQL |
| Social feed | AP | Seeing an old post for a few seconds is fine. The site being down is a headline. | Cassandra (used by Instagram?), DynamoDB |
| CI/CD pipeline state | CP | Recording an incorrect βbuild passedβ status erodes trust. Wait for quorum. | PostgreSQL, etcd, Consul |
| Distributed locks / coordination | CP | Linearizability is non-negotiable for locks. An unavailable lock is better than a broken lock. | etcd (Raft), Zookeeper (Zab), Consul |
| Content delivery (CDN) | AP | Serve stale cache during partition. Cannot serve = bad UX. Serving old version > 500 error. | CloudFront, Fastly, CloudFlare |
π§ Key Takeaways
-
βPick twoβ is misleading. You must pick P β partitions are inevitable. The real choice is C vs A during a partition.
-
Tunable consistency (DynamoDB, Cassandra) is the pragmatic middle ground. Choose consistency per operation, not per system.
-
PostgreSQL sync replication is CP: it refuses writes during partition. This is correct for financial data, terrible for social media.
-
Cassandra/DynamoDB are AP: they accept writes during partition and reconcile later. This works for most web workloads.
-
CRDTs make AP systems practical β they provide automatic conflict resolution without human intervention or complex rollback logic.
-
True CA systems donβt exist in distributed settings. If your system is βCA,β you havenβt experienced a partition yet.
-
The real world demands both. Many systems offer tunable consistency so you can be CP for critical operations and AP for everything else.
The CAP theorem doesnβt tell you what to build. It tells you what youβre giving up β so you can make that choice deliberately rather than discovering it during your next outage.