🌐 CAP Theorem with a Real Outage Story

Pranesh Nikhar May 4, 2026 05/04/26 10 min read

 󰈤

CAP theorem defined, why "pick two" is wrong, real outage stories from GitHub and DynamoDB, CP vs AP systems, CRDTs, tunable consistency, and a trade-off decision table for real workloads.

🧭 The Most Misunderstood Theorem in Distributed Systems

Every developer has heard “CAP theorem: pick two of Consistency, Availability, Partition Tolerance.” This is wrong — or at least dangerously incomplete.

The CAP theorem (Brewer’s conjecture, proven by Gilbert and Lynch in 2002) actually says:

When a network partition occurs, you must choose between consistency and availability.

Not at design time. At runtime, during a partition. The “pick two” framing makes it sound like you choose your trade-off once during architecture design. In reality, you must design for partitions (P is non-negotiable), then decide what happens when they occur.

Let’s look at what CAP actually means, what real outages teach us, and how systems like DynamoDB and Cassandra implement the trade-offs in practice.

📐 CAP Defined

CAP TRIANGLE:
                      ┌────────┐
                      │        │
              ┌───────┤   CP   ├───────┐
              │       │        │       │
              │       └────────┘       │
              ▼                        ▼
        ┌────────┐              ┌────────┐
        │        │   Partition   │        │
        │   CA   │◄────────────►│   AP   │
        │        │   (not real) │        │
        └────────┘              └────────┘

Property	Meaning
Consistency	Every read receives the most recent write or an error. All nodes see the same data at the same time (linearizability).
Availability	Every request receives a (non-error) response, without guarantee that it contains the most recent write.
Partition Tolerance	The system continues to operate despite an arbitrary number of messages being dropped or delayed between nodes.

Critical insight: In a distributed system, partitions are inevitable — network switches fail, packets are dropped, links degrade. You must tolerate partitions (P). The real question is: during a partition, do you prefer C or A?

Why CA Is a Lie

A “CA” system (Consistent + Available, no Partition Tolerance) would need a perfectly reliable network — which doesn’t exist. A single-node database is CA by default, but no distributed system can be both C and A when the network splits. If you claim your system is “CA,” it means you haven’t thought about partitions.

💥 Real Outage #1: GitHub Availability Incident (October 2018)

On October 21, 2018, GitHub experienced its most severe outage in years. A network partition between their US East Coast and US West Coast data centers caused 24 hours of degraded service.

What Happened

GitHub uses MySQL with Orchestrator for automatic failover. The partition:

GitHub's US East DC                     GitHub's US West DC
┌────────────────────┐    partition     ┌────────────────────┐
│  MySQL Primary     │───────✗─────────│  MySQL Replica     │
│  (writable)        │                  │  (read-only)       │
└────────────────────┘                  └────────────────────┘

During the network partition, Orchestrator (which manages MySQL failover) determined that the US West replica could not reach the US East primary. Orchestrator’s automated failover logic promoted the US West replica to primary. But the US East node was still running as primary — it just couldn’t talk to US West.

Result: Two MySQL primaries accepting writes (split-brain).

Since the systems that read from the US East primary (GitHub.com, API, Issues, Pull Requests) continued reading from it, and the systems that read from the newly-promoted US West primary also continued, the two data sets diverged.

User makes a PR comment on github.com
    → hits US East → writes to MySQL-1 (old primary)
User makes another comment
    → hits US West → writes to MySQL-2 (new primary)
After partition heals:
MySQL-1 has: "comment A"
MySQL-2 has: "comment B"
Replication can't merge these — conflict!

GitHub had to:

Identify which primary had the authoritative data
Manually resolve data conflicts
Rebuild replicas from the authoritative primary
Accept some data loss (some comments/issues lost)

CAP Analysis

GitHub’s MySQL setup was configured as a CP system — consistent replication with strict ordering. But the automatic failover violated the C guarantee by allowing writes to two primaries. During the partition, GitHub chose availability (keep writing) when their automation ran, but the system was designed for consistency. The mismatch caused the 24-hour outage.

Lesson: If you design for CP, you need to actually refuse writes during a partition. GitHub’s Orchestrator accidentally made the system AP during the outage, with all the conflict-resolution pain that entails.

💥 Real Outage #2: DynamoDB’s Pounding (AWS re:Invent 2012)

At AWS re:Invent 2012, Netflix’s presentation revealed how DynamoDB’s design choices during partitions affected real users.

The Setup

DynamoDB is built on Dynamo principles (the 2007 Dynamo paper). It’s an AP system by default: during a partition, DynamoDB prefers to accept writes on both sides and reconcile later.

DynamoDB Ring (simplified):
                    ┌─────┐
                    │ N1  │
                    /     \
            ┌─────┐         ┌─────┐
            │ N2  │         │ N3  │
            └─────┘         └─────┘
                \             /
                    ┌─────┐
                    │ N4  │
                    └─────┘

Each DynamoDB table has:

N (replication factor, default 3)
R (read quorum size)
W (write quorum size)

For strong consistency: R + W > N (e.g., R=2, W=2, N=3) For eventual consistency: W = 1, R = 1

What Happened

During the re:Invent keynote demo, DynamoDB’s request rates for some tables hit unexpected levels. The system’s partition detection kicked in, and some tables became unavailable for strongly-consistent reads while the partition was being resolved.

Normal:                        Partition:
Read "key_xyz":                Read "key_xyz" (strong):
  R → N1, N2 (strong)             R → N1 (can't reach N2!)
  N1: value                      ✗ Can't reach R quorum
  N2: value                       → Return error
  → Return value                 
                                 
                                 Read "key_xyz" (eventual):
                                   R → N1
                                   N1: value (may be stale)
                                   → Return value

The AP trade-off in action: during a partition, DynamoDB refused strongly-consistent reads (because it couldn’t assemble a full quorum) but continued to accept eventually-consistent reads and all writes.

DynamoDB also offers tunable consistency — you choose per-request:

# Eventually consistent (default — faster, cheaper)
response = table.get_item(Key={'pk': '123'})
# → "EventuallyConsistent" = True (half the read capacity cost)

# Strongly consistent (slower, 2× RCU cost)
response = table.get_item(Key={'pk': '123'}, ConsistentRead=True)
# → Returns the latest write or an error

The cost difference is real: strongly-consistent reads consume 2× the read capacity units because DynamoDB must contact all nodes in the quorum, not just the fastest replica.

🏗️ CP Systems: PostgreSQL Sync Replication

A classic CP design. PostgreSQL with synchronous replication:

Client writes "x = 42"
    │
    ▼
┌──────────────┐
│ PostgreSQL   │  WAL flushed to disk ✓
│ Primary      │  Waiting for replica...
└──────┬───────┘
       │ WAL record
       ▼
┌──────────────┐
│ PostgreSQL   │  WAL flushed to disk ✓
│ Replica      │  Sends ACK to primary
└──────┬───────┘
       │ ACK
       ▼
┌──────────────┐
│ Primary      │  Write confirmed to client
│ returns OK   │
└──────────────┘

During a Partition

Client writes "x = 42"
    │
    ▼
┌──────────────┐
│ PostgreSQL   │  WAL flushed ✓
│ Primary      │  Waiting for replica ACK...
└──────┬───────┘
       │ Partition! Packet dropped!
       ▼
┌──────────────┐
│ ✗ Replica    │  Unreachable
│              │
└──────────────┘
After timeout:
→ Primary refuses the write!
→ Client gets: "ERROR: could not serialize access"
→ Primary is still serving reads, still alive
→ But writes are blocked until replica comes back

This is the CP trade-off: you get consistency (if the replica can’t confirm the write, the write doesn’t happen) at the cost of availability (writes fail during the partition).

PostgreSQL also supports quorum sync (PostgreSQL 13+): you specify that G out of N replicas must ACK. If G=2, N=3, you lose one replica but still accept writes. This is a hybrid — you’re trading availability granularity.

🌊 AP Systems: Cassandra

Cassandra is the most prominent AP system. It’s a Dynamo-style database (same lineage as DynamoDB):

Cassandra Ring:
Each row has a partition key → determines coordinator node

Write "x = 42":
1. Client sends to any node (coordinator)
2. Coordinator writes to all replicas in parallel
3. Responds to client after W nodes acknowledge

Read "x": 
1. Coordinator queries R replicas
2. Picks the most recent version (by timestamp)
3. If versions diverge → read repair or hinted handoff

During a Partition

Replicas: N1, N2, N3 (RF=3)
Partition splits the cluster:

Group A (reachable): N1, N2
Group B (isolated):  N3

Write "x = 42" with W=2 (consistency level ONE):
→ N1, N2 acknowledge → client gets OK
→ N3 is missed → but it's fine! W=1 requires 1 node

Later, partition heals:
→ N3 has old value for x
→ Read repair triggers during the next read
→ OR: hinted handoff replays the write to N3
→ OR: anti-entropy repair runs periodically

In Cassandra, you choose consistency level per operation:

-- Strongest: QUORUM (R + W > RF)
SELECT * FROM users WHERE id = 123
    CONSISTENCY QUORUM;

-- Fastest: ONE (eventual)
SELECT * FROM users WHERE id = 123
    CONSISTENCY ONE;

-- Tolerance: ANY (write to coordinator's memory, even if all replicas down)
INSERT INTO users (id, name) VALUES (123, 'Alice')
    CONSISTENCY ANY;

Consistency Level	R / W	Behavior During Partition
`ANY`	W=any	Write accepted by coordinator — may be lost on coordinator crash
`ONE`	W=1	Write to any single replica — fastest, most available
`LOCAL_QUORUM`	R=2, W=2	Quorum within a single datacenter — ignores cross-DC
`EACH_QUORUM`	R=2, W=2 (each DC)	Strong but requires all DCs — unavailable during cross-DC partition
`ALL`	R=3, W=3	Write to all replicas — zero tolerance for failure
`SERIAL`	R=quorum + paxos	Linearizable consistency via Paxos — slowest

🧬 CRDTs: Reconciling Conflicts Automatically

Conflict-free Replicated Data Types (CRDTs) are the mechanism that makes AP systems work without human intervention. They provide automatic conflict resolution based on mathematical properties.

State-based CRDT (CvRDT)

Each node maintains a state that can be merged with any other node’s state using a commutative, associative, idempotent merge function:

# A Grow-Only Counter (G-Counter)
class GCounter:
    def __init__(self, node_id, num_nodes):
        self.node_id = node_id
        self.counts = [0] * num_nodes
    
    def increment(self):
        self.counts[self.node_id] += 1
    
    def value(self):
        return sum(self.counts)
    
    def merge(self, other):
        # Element-wise max — commutative, associative, idempotent
        for i in range(len(self.counts)):
            self.counts[i] = max(self.counts[i], other.counts[i])

# Node A: increment → [1, 0, 0]
# Node B: increment → [0, 1, 0]
# After partition + merge: [1, 1, 0] → value = 2 (correct!)

Operation-based CRDT (CmRDT)

Instead of merging states, nodes broadcast operations. If all operations are commutative, the order doesn’t matter:

# A Grow-Only Set (G-Set)
class GSet:
    def __init__(self):
        self.elements = set()
    
    def add(self, e):
        self.elements.add(e)  # Idempotent: adding twice is same as once
    
    def merge(self, other):
        self.elements |= other.elements  # Union is commutative

Real CRDT Implementations

System	CRDT Type	Real Usage
Riak	State-based (vectors)	Riak’s “last write wins” is a simple CRDT
Redis Enterprise	CRDT sets, counters, maps	Active-Active Redis geo-distributed
Automerge (JavaScript)	Multi-Value Registers + Sequences	Collaborative editing (like Google Docs)
delta-CRDTs	State-based, but sends diffs	Riak 2.0, NDN (Naspers)
SoundCloud	Custom CRDTs	Playlist ordering across devices

🎯 Practical Trade-off Decision Table

When would you choose CP vs AP? Here’s a decision table:

Use Case	CAP Choice	Why	Example Systems
Banking / Ledger	CP	Cannot lose or duplicate transactions. Refuse writes during partition rather than risk inconsistency.	PostgreSQL sync replication, Spanner
DNS	AP	Better to serve a slightly stale IP than return error. The internet itself works this way.	All DNS servers (AP by necessity)
Shopping cart	AP	Losing a cart item is worse than briefly seeing a stale cart. CRDTs reconcile smoothly.	DynamoDB, Cassandra
User sessions	AP	Stale session data (e.g., showing user as logged out for 1 second) is acceptable. Downtime is not.	Redis Cluster, ElastiCache
Stock inventory	CP	Overselling stock due to inconsistent counts costs real money and trust.	MySQL sync replication, PostgreSQL
Social feed	AP	Seeing an old post for a few seconds is fine. The site being down is a headline.	Cassandra (used by Instagram?), DynamoDB
CI/CD pipeline state	CP	Recording an incorrect “build passed” status erodes trust. Wait for quorum.	PostgreSQL, etcd, Consul
Distributed locks / coordination	CP	Linearizability is non-negotiable for locks. An unavailable lock is better than a broken lock.	etcd (Raft), Zookeeper (Zab), Consul
Content delivery (CDN)	AP	Serve stale cache during partition. Cannot serve = bad UX. Serving old version > 500 error.	CloudFront, Fastly, CloudFlare

🧠 Key Takeaways

“Pick two” is misleading. You must pick P — partitions are inevitable. The real choice is C vs A during a partition.
Tunable consistency (DynamoDB, Cassandra) is the pragmatic middle ground. Choose consistency per operation, not per system.
PostgreSQL sync replication is CP: it refuses writes during partition. This is correct for financial data, terrible for social media.
Cassandra/DynamoDB are AP: they accept writes during partition and reconcile later. This works for most web workloads.
CRDTs make AP systems practical — they provide automatic conflict resolution without human intervention or complex rollback logic.
True CA systems don’t exist in distributed settings. If your system is “CA,” you haven’t experienced a partition yet.
The real world demands both. Many systems offer tunable consistency so you can be CP for critical operations and AP for everything else.

The CAP theorem doesn’t tell you what to build. It tells you what you’re giving up — so you can make that choice deliberately rather than discovering it during your next outage.

← Previous
⚡ Compiled vs JIT vs Interpreted

Next →
🗄️ PostgreSQL MVCC Internals

← [b]ack

posts/ 🗄️ How PostgreSQL MVCC Works — Multi-Version Concurrency Control Deep Dive [n]ext → posts/ ✈️ CodePilot — From Requirements to Deployable FastAPI Backend

🧭 The Most Misunderstood Theorem in Distributed Systems

📐 CAP Defined

Why CA Is a Lie

💥 Real Outage #1: GitHub Availability Incident (October 2018)

What Happened

CAP Analysis

💥 Real Outage #2: DynamoDB’s Pounding (AWS re:Invent 2012)

The Setup

What Happened

🏗️ CP Systems: PostgreSQL Sync Replication

During a Partition

🌊 AP Systems: Cassandra

During a Partition

🧬 CRDTs: Reconciling Conflicts Automatically

State-based CRDT (CvRDT)

Operation-based CRDT (CmRDT)

Real CRDT Implementations

🎯 Practical Trade-off Decision Table

🧠 Key Takeaways

📖 Series Navigation