ποΈ How PostgreSQL MVCC Works β Multi-Version Concurrency Control Deep Dive
How PostgreSQL handles concurrent reads and writes without locking through Multi-Version Concurrency Control β xmin/xmax system columns, transaction snapshots, isolation levels, vacuum internals, and tuple visibility rules.
The Concurrency Problem
Every database with concurrent access faces a fundamental tension: reads and writes must coexist. A naive approach β lock the entire table for every write β serializes everything and kills throughput. The classic alternative, Two-Phase Locking (2PL), uses read locks and write locks with strict acquisition rules, but it comes with a cost: writers block readers, readers block writers, and deadlock is always lurking.
PostgreSQL sidesteps this entirely with MVCC: instead of locking rows when a write happens, it creates new versions of those rows. Readers see a consistent snapshot of the data as it existed when their transaction began. Writers create new row versions that only become visible once the writing transaction commits.
-- Transaction A
BEGIN;
UPDATE accounts SET balance = balance - 100 WHERE id = 1;
-- Transaction B (concurrent)
BEGIN;
SELECT balance FROM accounts WHERE id = 1;
-- B sees the OLD balance, not affected by A's uncommitted update
The result is a database where readers never block writers, and writers never block readers. This is the single most important performance property of PostgreSQL, and itβs why PostgreSQL can drive thousands of concurrent connections on modest hardware.
xmin and xmax: The System Columns
Every row in PostgreSQL (technically every tuple in every heap page) carries two hidden system columns:
| Column | Type | Purpose |
|---|---|---|
xmin | xid | The transaction ID that created this row version |
xmax | xid | The transaction ID that deleted/expired this row version (0 if active) |
Transaction IDs (xid) are 32-bit unsigned integers. They start at 3 (0 and 1 are reserved, 2 is the βfrozenβ XID for very old tuples) and wrap around after ~4 billion transactions β more on that when we discuss vacuum.
When you INSERT a row, PostgreSQL sets xmin to the current transaction ID. When you UPDATE a row, PostgreSQL marks the existing tuple as deleted by setting its xmax to the current transaction ID, then inserts a new version of the tuple with xmin set to the current transaction ID. A DELETE simply sets xmax on the existing tuple.
INSERT: xmin = TxID, xmax = 0 β visible if TxID committed
UPDATE: old tuple: xmax = TxID β invisible after TxID commits
new tuple: xmin = TxID, xmax = 0 β visible after TxID commits
DELETE: old tuple: xmax = TxID β invisible after TxID commits
Transaction Snapshots
When a transaction begins, PostgreSQL takes a snapshot of which transactions are currently in-flight. The snapshot structure contains:
snapshot {
xmin: lowest transaction ID still active in the system
xmax: next transaction ID to be assigned (all >= xmax are invisible)
xip_list: array of active transaction IDs between xmin and xmax
}
A tuple is then visible to the snapshot if:
- Its
xminis committed and is older thanxminof the snapshot (definitely finished), or - Its
xminis in thexip_listbut belongs to our own transaction (we can see our own changes), and - Its
xmaxis either 0 (not deleted) or the deleting transaction is not yet committed or is invisible to our snapshot.
This visibility check runs for every tuple access. PostgreSQL caches the snapshot in the backend process memory to avoid recomputing it on every row.
Isolation Levels and Snapshot Acquisition
PostgreSQL implements three of the four SQL standard isolation levels (Read Uncommitted behaves identically to Read Committed).
Read Committed (Default)
A new snapshot is taken for each statement within the transaction. This means two SELECT statements in the same transaction can see different data if another transaction commits between them.
# Transaction A
BEGIN; # TxID = 100
SELECT balance FROM accounts WHERE id = 1; # Snapshot at stmt start β 1000
# Transaction B (concurrent)
UPDATE accounts SET balance = 500 WHERE id = 1; # TxID = 101
COMMIT; # 101 commits
# Back in Transaction A
SELECT balance FROM accounts WHERE id = 1; # NEW snapshot β 500
COMMIT;
This is the most common isolation level in production PostgreSQL. It gives good performance with minimal overhead, and the statement-level snapshot behavior is intuitive for most applications.
Repeatable Read
A single snapshot is taken at the beginning of the first statement of the transaction. All subsequent statements see the same snapshot, regardless of what other transactions commit.
BEGIN ISOLATION LEVEL REPEATABLE READ;
SELECT balance FROM accounts WHERE id = 1; -- β 1000
-- Another transaction commits a change to id=1
SELECT balance FROM accounts WHERE id = 1; -- β 1000 (same snapshot!)
COMMIT;
If you attempt to UPDATE a row that another transaction has already modified and committed, PostgreSQL raises a serialization_failure error rather than silently overwriting the change. This is the first-committer-wins rule.
BEGIN ISOLATION LEVEL REPEATABLE READ;
UPDATE accounts SET balance = balance - 100 WHERE id = 1;
-- ERROR: could not serialize access due to concurrent update
Serializable
Uses the same snapshot-once approach as Repeatable Read, but adds Serializable Snapshot Isolation (SSI). SSI tracks read-write conflicts between concurrent transactions using predicate locks and aborts transactions when it detects a pattern that could produce a serialization anomaly.
The overhead is significant β predicate locks consume memory proportional to the number of rows accessed β but itβs the only isolation level that guarantees true serializability without heavy-weight locking.
Isolation Level | Snapshot | Phantom? | Serialization Anomaly?
----------------------|-----------|------------|-----------------------
Read Uncommitted | per-stmt | possible | possible
Read Committed | per-stmt | possible | possible
Repeatable Read | per-txn | not possible | possible (write skew)
Serializable | per-txn | not possible | not possible
The Vacuum Process
MVCC creates dead tuples β row versions that are no longer visible to any active transaction. If left unchecked, these dead tuples bloat the table, wasting disk space and slowing down sequential scans. This is where VACUUM enters.
Heap Page Before Vacuum:
ββββββββββββ¬βββββββββββ¬βββββββββββ¬βββββββββββ
β Tuple 1 β Tuple 2 β Tuple 3 β Tuple 4 β
β (alive) β (dead) β (alive) β (dead) β
ββββββββββββ΄βββββββββββ΄βββββββββββ΄βββββββββββ
Heap Page After Vacuum:
ββββββββββββ¬βββββββββββ¬βββββββββββ¬βββββββββ
β Tuple 1 β Tuple 3 β β FREE β
β (alive) β (alive) β β SPACE β
ββββββββββββ΄βββββββββββ΄βββββββββββ΄βββββββββ
PostgreSQLβs autovacuum daemon runs automatically. It wakes up every autovacuum_naptime (default 1 minute) and checks each table. A table needs vacuuming when:
- The number of dead tuples exceeds
autovacuum_vacuum_threshold + autovacuum_vacuum_scale_factor * reltuples
Vacuum does three things:
- Scans heap pages and builds a list of dead tuple IDs
- Removes index entries pointing to dead tuples (a costly step β indexes must be scanned)
- Marks dead tuple space as reusable in the pageβs free space map
What VACUUM Doesnβt Do
It does not compact pages or return disk space to the OS. For that you need VACUUM FULL, which rewrites the entire table β but it takes an ACCESS EXCLUSIVE lock, blocking all reads and writes.
-- Standard vacuum (non-blocking, but doesn't shrink table)
VACUUM accounts;
-- Full vacuum (blocking, shrinks table)
VACUUM FULL accounts;
-- Analyze after vacuum to update planner statistics
VACUUM ANALYZE accounts;
HOT Updates
When a tuple is updated, PostgreSQL ideally wants the new version to be on the same heap page as the old version β this is called a Heap-Only Tuple (HOT) update. The condition is: none of the indexed columns changed.
-- HOT update (no indexed column changed)
UPDATE users SET bio = 'new bio' WHERE id = 1;
-- Non-HOT update (indexed column 'email' changed)
UPDATE users SET email = 'new@email.com' WHERE id = 1;
HOT updates avoid creating new index entries because the new tuple is reachable through the old tupleβs heap page. This dramatically reduces vacuum overhead for tables that are frequently updated but only on non-indexed columns.
Readers Never Block Writers
This is the headline feature of MVCC, and itβs worth understanding exactly how it works at the implementation level.
When a reader needs to access a tuple, they check visibility using their snapshot. The reader never needs to acquire a lock on the tuple β the visibility check is purely a comparison of transaction IDs against the snapshot. Even if a writer is in the middle of an UPDATE that has xmax-marked the tuple, the readerβs snapshot either shows the old transaction as not-yet-committed (in which case the reader sees the old version) or skips past the xmax-marked tuple to find the newer version.
# Pseudocode: PostgreSQL tuple visibility check
def tuple_visible(tuple, snapshot, self_txid):
# Created in future? Not visible.
if tuple.xmin >= snapshot.xmax:
return False
# Created by our transaction? Visible.
if tuple.xmin == self_txid:
return True
# Created by an in-flight transaction? Not visible.
if snapshot.xmin <= tuple.xmin < snapshot.xmax:
if tuple.xmin in snapshot.xip_list:
return False
# Check deletion
if tuple.xmax == 0:
return True # Not deleted
if tuple.xmax == self_txid:
return False # We deleted it (DELETE or UPDATE)
# Deleted by a committed transaction? Not visible.
if tuple.xmax < snapshot.xmin:
return False # Assumes xmax committed
# Deleted by an in-flight transaction? Still visible.
return True
Comparison with 2PL
| Property | MVCC (PostgreSQL) | Two-Phase Locking |
|---|---|---|
| Readers block writers | Never | Yes (depending on lock mode) |
| Writers block readers | Never | Yes (write locks block read locks) |
| Deadlock possibility | No (reads donβt lock) | Yes |
| Storage overhead | Higher (dead tuples, vacuum) | Lower (in-place updates) |
| Read throughput | Excellent (no blocking) | Degrades under contention |
| Write throughput | Good (sequential scan of dead tuples) | Poor under contention |
The tradeoff is storage vs locking contention. In the modern world of cheap storage and high-concurrency workloads, MVCC wins decisively.
Real-World Performance at Scale
PostgreSQLβs MVCC handles remarkable throughput when properly tuned. Key configuration parameters for write-heavy workloads:
# Increase autovacuum aggressiveness for write-heavy tables
autovacuum_vacuum_scale_factor = 0.01 # Recompute after 1% of rows change
autovacuum_vacuum_threshold = 50 # Minimum dead tuples before vacuum
autovacuum_naptime = 30 # Check every 30 seconds
vacuum_cost_limit = 2000 # More aggressive vacuum I/O
# Monitor bloat
SELECT schemaname, tablename, n_dead_tup, n_live_tup,
round(n_dead_tup::numeric / (n_live_tup + n_dead_tup) * 100, 2) AS dead_pct
FROM pg_stat_user_tables
WHERE n_live_tup > 0
ORDER BY dead_pct DESC;
At extreme scale, the main MVCC downsides emerge:
-
Transaction ID wraparound: Every ~4 billion transactions, PostgreSQL must freeze old tuples or risk data corruption. autovacuum normally handles this, but if it falls behind, the database will shut down. Monitor
SELECT datname, age(datfrozenxid) FROM pg_database; -
Index bloat: Dead index entries accumulate with every UPDATE. B-tree indexes on frequently-updated columns can balloon to 10Γ the table size. A
REINDEX CONCURRENTLYis the fix. -
Visibility map thrashing: The visibility map tracks which pages are βall visibleβ (no dead tuples). Under extremely high write loads, the visibility map is constantly being updated, which adds checkpoint strain.
Summary
PostgreSQLβs MVCC is the foundation of its reliability and concurrency model. By keeping multiple versions of each row and giving each transaction a consistent snapshot, PostgreSQL eliminates read-write contention β the most common bottleneck in database-backed applications. The cost is storage overhead and vacuum maintenance, but for the vast majority of workloads, the tradeoff is overwhelmingly positive.
PostgreSQL MVCC Docs The Internals of PostgreSQL β comprehensive guide Hellerstein et al., βArchitecture of a Database Systemβ