🗄️ How PostgreSQL MVCC Works — Multi-Version Concurrency Control Deep Dive

Pranesh Nikhar Jun 23, 2026 06/23/26 9 min read

 󰈤

How PostgreSQL handles concurrent reads and writes without locking through Multi-Version Concurrency Control — xmin/xmax system columns, transaction snapshots, isolation levels, vacuum internals, and tuple visibility rules.

The Concurrency Problem

Every database with concurrent access faces a fundamental tension: reads and writes must coexist. A naive approach — lock the entire table for every write — serializes everything and kills throughput. The classic alternative, Two-Phase Locking (2PL), uses read locks and write locks with strict acquisition rules, but it comes with a cost: writers block readers, readers block writers, and deadlock is always lurking.

PostgreSQL sidesteps this entirely with MVCC: instead of locking rows when a write happens, it creates new versions of those rows. Readers see a consistent snapshot of the data as it existed when their transaction began. Writers create new row versions that only become visible once the writing transaction commits.

-- Transaction A
BEGIN;
UPDATE accounts SET balance = balance - 100 WHERE id = 1;
-- Transaction B (concurrent)
BEGIN;
SELECT balance FROM accounts WHERE id = 1;
-- B sees the OLD balance, not affected by A's uncommitted update

The result is a database where readers never block writers, and writers never block readers. This is the single most important performance property of PostgreSQL, and it’s why PostgreSQL can drive thousands of concurrent connections on modest hardware.

xmin and xmax: The System Columns

Every row in PostgreSQL (technically every tuple in every heap page) carries two hidden system columns:

Column	Type	Purpose
`xmin`	`xid`	The transaction ID that created this row version
`xmax`	`xid`	The transaction ID that deleted/expired this row version (0 if active)

Transaction IDs (xid) are 32-bit unsigned integers. They start at 3 (0 and 1 are reserved, 2 is the “frozen” XID for very old tuples) and wrap around after ~4 billion transactions — more on that when we discuss vacuum.

When you INSERT a row, PostgreSQL sets xmin to the current transaction ID. When you UPDATE a row, PostgreSQL marks the existing tuple as deleted by setting its xmax to the current transaction ID, then inserts a new version of the tuple with xmin set to the current transaction ID. A DELETE simply sets xmax on the existing tuple.

INSERT:  xmin = TxID, xmax = 0          → visible if TxID committed
UPDATE:  old tuple: xmax = TxID          → invisible after TxID commits
         new tuple: xmin = TxID, xmax = 0 → visible after TxID commits
DELETE:  old tuple: xmax = TxID          → invisible after TxID commits

Transaction Snapshots

When a transaction begins, PostgreSQL takes a snapshot of which transactions are currently in-flight. The snapshot structure contains:

snapshot {
  xmin:    lowest transaction ID still active in the system
  xmax:    next transaction ID to be assigned (all >= xmax are invisible)
  xip_list: array of active transaction IDs between xmin and xmax
}

A tuple is then visible to the snapshot if:

Its xmin is committed and is older than xmin of the snapshot (definitely finished), or
Its xmin is in the xip_list but belongs to our own transaction (we can see our own changes), and
Its xmax is either 0 (not deleted) or the deleting transaction is not yet committed or is invisible to our snapshot.

This visibility check runs for every tuple access. PostgreSQL caches the snapshot in the backend process memory to avoid recomputing it on every row.

Isolation Levels and Snapshot Acquisition

PostgreSQL implements three of the four SQL standard isolation levels (Read Uncommitted behaves identically to Read Committed).

Read Committed (Default)

A new snapshot is taken for each statement within the transaction. This means two SELECT statements in the same transaction can see different data if another transaction commits between them.

# Transaction A
BEGIN;                                          # TxID = 100
SELECT balance FROM accounts WHERE id = 1;       # Snapshot at stmt start → 1000
# Transaction B (concurrent)
UPDATE accounts SET balance = 500 WHERE id = 1;  # TxID = 101
COMMIT;                                          # 101 commits
# Back in Transaction A
SELECT balance FROM accounts WHERE id = 1;       # NEW snapshot → 500
COMMIT;

This is the most common isolation level in production PostgreSQL. It gives good performance with minimal overhead, and the statement-level snapshot behavior is intuitive for most applications.

Repeatable Read

A single snapshot is taken at the beginning of the first statement of the transaction. All subsequent statements see the same snapshot, regardless of what other transactions commit.

BEGIN ISOLATION LEVEL REPEATABLE READ;
SELECT balance FROM accounts WHERE id = 1;  -- → 1000
-- Another transaction commits a change to id=1
SELECT balance FROM accounts WHERE id = 1;  -- → 1000 (same snapshot!)
COMMIT;

If you attempt to UPDATE a row that another transaction has already modified and committed, PostgreSQL raises a serialization_failure error rather than silently overwriting the change. This is the first-committer-wins rule.

BEGIN ISOLATION LEVEL REPEATABLE READ;
UPDATE accounts SET balance = balance - 100 WHERE id = 1;
-- ERROR: could not serialize access due to concurrent update

Serializable

Uses the same snapshot-once approach as Repeatable Read, but adds Serializable Snapshot Isolation (SSI). SSI tracks read-write conflicts between concurrent transactions using predicate locks and aborts transactions when it detects a pattern that could produce a serialization anomaly.

The overhead is significant — predicate locks consume memory proportional to the number of rows accessed — but it’s the only isolation level that guarantees true serializability without heavy-weight locking.

Isolation Level       | Snapshot  | Phantom?   | Serialization Anomaly?
----------------------|-----------|------------|-----------------------
Read Uncommitted      | per-stmt  | possible   | possible
Read Committed        | per-stmt  | possible   | possible
Repeatable Read       | per-txn   | not possible | possible (write skew)
Serializable          | per-txn   | not possible | not possible

The Vacuum Process

MVCC creates dead tuples — row versions that are no longer visible to any active transaction. If left unchecked, these dead tuples bloat the table, wasting disk space and slowing down sequential scans. This is where VACUUM enters.

Heap Page Before Vacuum:
┌──────────┬──────────┬──────────┬──────────┐
│ Tuple 1  │ Tuple 2  │ Tuple 3  │ Tuple 4  │
│ (alive)  │ (dead)   │ (alive)  │ (dead)   │
└──────────┴──────────┴──────────┴──────────┘

Heap Page After Vacuum:
┌──────────┬──────────┬──────────┬────────┐
│ Tuple 1  │ Tuple 3  │          │ FREE   │
│ (alive)  │ (alive)  │          │ SPACE  │
└──────────┴──────────┴──────────┴────────┘

PostgreSQL’s autovacuum daemon runs automatically. It wakes up every autovacuum_naptime (default 1 minute) and checks each table. A table needs vacuuming when:

The number of dead tuples exceeds autovacuum_vacuum_threshold + autovacuum_vacuum_scale_factor * reltuples

Vacuum does three things:

Scans heap pages and builds a list of dead tuple IDs
Removes index entries pointing to dead tuples (a costly step — indexes must be scanned)
Marks dead tuple space as reusable in the page’s free space map

What VACUUM Doesn’t Do

It does not compact pages or return disk space to the OS. For that you need VACUUM FULL, which rewrites the entire table — but it takes an ACCESS EXCLUSIVE lock, blocking all reads and writes.

-- Standard vacuum (non-blocking, but doesn't shrink table)
VACUUM accounts;

-- Full vacuum (blocking, shrinks table)
VACUUM FULL accounts;

-- Analyze after vacuum to update planner statistics
VACUUM ANALYZE accounts;

HOT Updates

When a tuple is updated, PostgreSQL ideally wants the new version to be on the same heap page as the old version — this is called a Heap-Only Tuple (HOT) update. The condition is: none of the indexed columns changed.

-- HOT update (no indexed column changed)
UPDATE users SET bio = 'new bio' WHERE id = 1;

-- Non-HOT update (indexed column 'email' changed)
UPDATE users SET email = 'new@email.com' WHERE id = 1;

HOT updates avoid creating new index entries because the new tuple is reachable through the old tuple’s heap page. This dramatically reduces vacuum overhead for tables that are frequently updated but only on non-indexed columns.

Readers Never Block Writers

This is the headline feature of MVCC, and it’s worth understanding exactly how it works at the implementation level.

When a reader needs to access a tuple, they check visibility using their snapshot. The reader never needs to acquire a lock on the tuple — the visibility check is purely a comparison of transaction IDs against the snapshot. Even if a writer is in the middle of an UPDATE that has xmax-marked the tuple, the reader’s snapshot either shows the old transaction as not-yet-committed (in which case the reader sees the old version) or skips past the xmax-marked tuple to find the newer version.

# Pseudocode: PostgreSQL tuple visibility check
def tuple_visible(tuple, snapshot, self_txid):
    # Created in future? Not visible.
    if tuple.xmin >= snapshot.xmax:
        return False

    # Created by our transaction? Visible.
    if tuple.xmin == self_txid:
        return True

    # Created by an in-flight transaction? Not visible.
    if snapshot.xmin <= tuple.xmin < snapshot.xmax:
        if tuple.xmin in snapshot.xip_list:
            return False

    # Check deletion
    if tuple.xmax == 0:
        return True  # Not deleted

    if tuple.xmax == self_txid:
        return False  # We deleted it (DELETE or UPDATE)

    # Deleted by a committed transaction? Not visible.
    if tuple.xmax < snapshot.xmin:
        return False  # Assumes xmax committed

    # Deleted by an in-flight transaction? Still visible.
    return True

Comparison with 2PL

Property	MVCC (PostgreSQL)	Two-Phase Locking
Readers block writers	Never	Yes (depending on lock mode)
Writers block readers	Never	Yes (write locks block read locks)
Deadlock possibility	No (reads don’t lock)	Yes
Storage overhead	Higher (dead tuples, vacuum)	Lower (in-place updates)
Read throughput	Excellent (no blocking)	Degrades under contention
Write throughput	Good (sequential scan of dead tuples)	Poor under contention

The tradeoff is storage vs locking contention. In the modern world of cheap storage and high-concurrency workloads, MVCC wins decisively.

Real-World Performance at Scale

PostgreSQL’s MVCC handles remarkable throughput when properly tuned. Key configuration parameters for write-heavy workloads:

# Increase autovacuum aggressiveness for write-heavy tables
autovacuum_vacuum_scale_factor = 0.01       # Recompute after 1% of rows change
autovacuum_vacuum_threshold = 50             # Minimum dead tuples before vacuum
autovacuum_naptime = 30                      # Check every 30 seconds
vacuum_cost_limit = 2000                     # More aggressive vacuum I/O

# Monitor bloat
SELECT schemaname, tablename, n_dead_tup, n_live_tup,
       round(n_dead_tup::numeric / (n_live_tup + n_dead_tup) * 100, 2) AS dead_pct
FROM pg_stat_user_tables
WHERE n_live_tup > 0
ORDER BY dead_pct DESC;

At extreme scale, the main MVCC downsides emerge:

Transaction ID wraparound: Every ~4 billion transactions, PostgreSQL must freeze old tuples or risk data corruption. autovacuum normally handles this, but if it falls behind, the database will shut down. Monitor SELECT datname, age(datfrozenxid) FROM pg_database;
Index bloat: Dead index entries accumulate with every UPDATE. B-tree indexes on frequently-updated columns can balloon to 10× the table size. A REINDEX CONCURRENTLY is the fix.
Visibility map thrashing: The visibility map tracks which pages are “all visible” (no dead tuples). Under extremely high write loads, the visibility map is constantly being updated, which adds checkpoint strain.

Summary

PostgreSQL’s MVCC is the foundation of its reliability and concurrency model. By keeping multiple versions of each row and giving each transaction a consistent snapshot, PostgreSQL eliminates read-write contention — the most common bottleneck in database-backed applications. The cost is storage overhead and vacuum maintenance, but for the vast majority of workloads, the tradeoff is overwhelmingly positive.

PostgreSQL MVCC Docs The Internals of PostgreSQL — comprehensive guide Hellerstein et al., “Architecture of a Database System”

← Previous
🎯 CAP Theorem with a Real Outage

Next →
⚓ Raft Consensus Algorithm

← [b]ack

posts/ 🧩 LeetCode vs Real Engineering Skills [n]ext → posts/ 🌐 CAP Theorem with a Real Outage Story