🌳 LSM Trees & Bloom Filters — Production Deep Dive

Pranesh Nikhar Mar 23, 2026 03/23/26 9 min read

 󰈤

Why LSM trees exist, how they work (MemTable → WAL → SSTable → compaction), the read path with bloom filters, tiered vs leveled compaction, write amplification, and the RUM conjecture.

🧱 The Problem LSM Trees Solve

Databases are built around one fundamental constraint: random writes are slow on physical media.

Media	Random Write	Sequential Write	Ratio
HDD	~0.5 MB/s (seek time ~10ms)	~200 MB/s	1:400
SSD	~50 MB/s (write amplification, erase blocks)	~2000 MB/s	1:40

If you insert 1,000 rows into a B-tree one at a time, each insert triggers a random write to disk. On an HDD, that’s 10 seconds of seek time for 1,000 random inserts.

LSM (Log-Structured Merge) trees solve this by buffering writes in memory and flushing them as large sequential batches to disk. The idea originated in the 1996 Ousterhout paper on log-structured file systems and was popularized by Google’s Bigtable (2006), then LevelDB, RocksDB, Cassandra, ScyllaDB, and many others.

🏗️ LSM Tree Architecture

An LSM tree has three main components:

                    ┌─────────────┐
                    │  MemTable    │  ← in-memory balanced tree (red-black / skip list)
                    │  (read+write)│
                    └──────┬──────┘
                           │ flush (when full)
                           ▼
┌──────────────────────────────────────────────┐
│  WAL (Write-Ahead Log) on disk               │
│  ┌────┬────┬────┬────┬────┬────┬────┬────┐  │
│  │ op │ op │ op │ op │ op │ op │ op │ op │  │  ← sequential append
│  └────┴────┴────┴────┴────┴────┴────┴────┘  │
└──────────────────────────────────────────────┘
                           │
                           ▼
                    ┌─────────────┐
                    │  Immutable   │  ← frozen, no longer accepting writes
                    │  MemTable    │
                    └──────┬──────┘
                           │ flush to disk
                           ▼
              ┌─────────────────────────┐
              │  SSTable File (Level 0) │  ← sorted, immutable data file
              │  [Footer]               │
              │  [Index Block]          │
              │  [Bloom Filter Block]   │
              │  [Data Blocks...]       │
              └─────────────────────────┘
                           │
                    compaction merges
                           ▼
              ┌─────────────────────────┐
              │  SSTable (Level 1)      │  ← larger, merged
              └─────────────────────────┘

MemTable

The MemTable is an in-memory data structure (typically a skip list in LevelDB/RocksDB or a red-black tree). All writes go here first:

// Pseudocode: skipping to a skip list
void put(const Slice& key, const Slice& value) {
    WAL_append(key, value);              // Durability: write to log first
    memtable->insert(key, value);        // Then update in-memory table
}

The MemTable supports O(log n) reads and writes. When it reaches a configurable size (typically 4-64 MB), it’s made immutable and a new MemTable takes its place.

WAL (Write-Ahead Log)

Before updating the MemTable, every write is appended to the WAL — a sequential file on disk. This provides durability: if the process crashes, the WAL is replayed on restart to reconstruct the MemTable.

WAL is written sequentially → fast on both HDD and SSD
On crash recovery: replay WAL from last checkpoint
WAL can be fsync’d per write (durability) or batched (performance)

SSTable (Sorted String Table)

When an immutable MemTable is flushed to disk, it becomes an SSTable — an immutable, sorted file on disk. The format:

┌─────────────────────────────┐
│ Data Block 0  (keys a-f)    │
│ Data Block 1  (keys g-l)    │
│ Data Block 2  (keys m-r)    │
│ Data Block 3  (keys s-z)    │
├─────────────────────────────┤
│ Bloom Filter (all keys)     │  ← false-positive check
├─────────────────────────────┤
│ Index                       │  ← for each block: last key + offset
│  ┌───────────────────────┐  │
│  │ Block 0: offset=0     │  │
│  │ Block 1: offset=4096   │  │
│  │ Block 2: offset=8192   │  │
│  │ Block 3: offset=12288  │  │
│  └───────────────────────┘  │
├─────────────────────────────┤
│ Footer                      │  ← pointer to index + bloom filter
└─────────────────────────────┘

The index block at the end is loaded into memory on SSTable open. It maps the last key of each data block to the block’s offset, so the read path can binary-search the index to find which data block to load.

🔍 The Read Path (Step by Step)

Reading a key from an LSM tree is more expensive than writing to one because data exists in multiple levels:

Read "key_xyz":

1. Check MemTable (in memory)
   ├── Found? Return immediately.
   └── Not found? Continue.

2. Check immutable MemTable (in memory)
   ├── Found? Return immediately.
   └── Not found? Continue.

3. For each SSTable, newest to oldest:
   a. Query the SSTable's bloom filter
      ├── "key_xyz definitely not here" → skip this SSTable
      └── "key_xyz might be here" → continue
   
   b. Binary-search the in-memory index to find the relevant data block
   c. Read and decompress the data block from disk
   d. Binary-search within the data block
      ├── Found? Return value.
      └── Not found? Continue to next SSTable.

4. Key does not exist.

This is why LSM-tree reads can be expensive: in the worst case, you check every level. Bloom filters are critical for skipping SSTables that can’t contain the key.

🌸 Bloom Filters: How They Work

A bloom filter is a probabilistic data structure that answers: “Is element x in set S?”

No false negatives: If the filter says “no”, the element is definitely not in the set.
False positives possible: If the filter says “yes”, the element might be in the set (we still need to check the actual data).

The Math

A bloom filter is a bit array of m bits with k hash functions.

Insert "apple":
  h1("apple") = 2   → set bit 2
  h2("apple") = 7   → set bit 7
  h3("apple") = 12  → set bit 12

Insert "banana":
  h1("banana") = 5  → set bit 5
  h2("banana") = 7  → set bit 7 (already set)
  h3("banana") = 1  → set bit 1

Query "apple":
  h1("apple") = 2   → bit is 1 ✓
  h2("apple") = 7   → bit is 1 ✓
  h3("apple") = 12  → bit is 1 ✓  → "probably present"

Query "grape":
  h1("grape") = 2   → bit is 1
  h2("grape") = 9   → bit is 0 ✗ → "definitely absent"

Optimal Parameter Calculation

Given n elements and desired false-positive rate p:

m = -n · ln(p) / (ln(2))²     // optimal number of bits
k = (m/n) · ln(2)              // optimal number of hash functions

Desired FPR	`m/n` (bits per key)	`k` (hash functions)
10%	4.8	3
1%	9.6	7
0.1%	14.4	10
0.01%	19.2	14

RocksDB default SSTable bloom filter: 10 bits per key (~0.8% false positive rate).

Practical Implementation

A common optimization: each SSTable has its own bloom filter. The filter for L0 might have 100 keys (smaller, less accurate), while the filter for L3 might have 10M keys (larger, more memory).

// Simplified bloom filter class
class BloomFilter {
    std::bitset<m> bits;
    HashFn hashes[k];

    void insert(Slice key) {
        for (auto& h : hashes) {
            bits.set(h(key) % m);
        }
    }

    bool possibly_contains(Slice key) {
        for (auto& h : hashes) {
            if (!bits.test(h(key) % m)) return false;
        }
        return true; // may be false positive
    }
};

🔄 Compaction: Why LSM Trees Don’t Grow Forever

Without compaction, you’d have thousands of SSTables on disk, and every read would need to check all of them. Compaction merges SSTables together and discards old data.

Tiered Compaction (Cassandra / HBase)

Level 0: [sst1] [sst2] [sst3] [sst4]  ← up to N files
                  ↓ merge to L1
Level 1: [sst_large]                   ← single merged file
                  ↓ when L1 gets big
Level 2: [sst_even_larger]

Strategy: Each level can hold up to N SSTables. When a level exceeds N, all its files are merged into one file in the next level.

Pro: Lower write amplification (data is compacted only once when promoted)
Con: Temporary space amplification (multiple copies of the same key exist)

Leveled Compaction (LevelDB / RocksDB default)

Level 0: [sst1] [sst2] [sst3]  ← files may overlap in key range
Level 1: ┌──┬──┬──┬──┬──┬──┐  ← non-overlapping, sorted runs
         │a-e│f-j│k-o│p-t│u-x│y-z│
         └──┴──┴──┴──┴──┴──┘
Level 2: ┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐
         │a-c│d-f│g-i│j-l│m-o│p-r│s-u│v-x│y-z│...
         └──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┘
         (10× larger)

Strategy: Each level is 10× larger than the previous. An L1 file is compacted into a subset of L2 files that overlap in key range.

Pro: Better read performance (fewer files to check per level)
Con: Higher write amplification (same key data gets rewritten multiple times as it flows down levels)

Write Amplification

Write amplification is the ratio of bytes written to disk vs bytes of new data ingested:

Write Amplification = (Total bytes written to disk) / (Ingested data)

Compaction	Typical Write Amp	Typical Space Amp
Leveled (LevelDB)	10-30×	1.1×
Tiered (Cassandra)	3-10×	1.5-3×
Size-Tiered (HBase)	4-8×	2-5×

A write amplification of 20× means: if you write 1 GB of new data, 20 GB is written to disk (including compaction rewrites). This matters for SSD lifespan — consumer SSDs are rated for ~100-300 TBW (Total Bytes Written).

📐 The RUM Conjecture

The RUM Conjecture (Read Overhead, Update Overhead, Memory/Storage Overhead) states that for a data structure or access method:

You can optimize any two of Read, Update, and Memory but must sacrifice the third.

                    Read
                     ▲
                     │
        B-tree       │     Hash table
       (R low,      │    (R low,
        U high,     │     U low,
        M mid)      │     M high)
                     │
                     ├──────────► Update
                     │
                     │
        LSM Tree     │
       (R high,      │
        U low,       │
        M mid)       │
                     │
                    Memory

B-tree: Fast reads (one traversal to leaf), slow random writes (cache line split, rebalancing), medium memory.
Hash table: Fast reads (O(1)), fast writes (O(1) amortized), high memory (keep everything in RAM).
LSM tree: Fast writes (sequential), slow reads (check N levels), medium memory (bloom filters + index blocks).

In practice, you tune LSM parameters to shift along the RUM triangle:

More bloom filter bits → more memory, fewer false positives → faster reads
Smaller SSTable sizes → more files to check → slower reads, faster compaction
Leveled compaction → better reads, worse write amplification

🏭 Production Examples

RocksDB (Facebook/Meta)

RocksDB powers MySQL’s MyRocks storage engine, Apache Kafka’s internal state stores, and many more systems:

// RocksDB configuration for write-heavy workload
Options options;
options.create_if_missing = true;
options.write_buffer_size = 64 << 20;        // 64 MB MemTable
options.max_write_buffer_number = 4;         // up to 3 immutables
options.target_file_size_base = 64 << 20;    // 64 MB SSTables
options.max_bytes_for_level_base = 512 << 20; // 512 MB for L1
options.soft_pending_compaction_bytes_limit = 64ULL << 30;
options.level0_slowdown_writes_trigger = 20;
options.level0_stop_writes_trigger = 36;

Key RocksDB features:

Prefix bloom filters: Skip reading entire SSTable when prefix is known
Partitioned index/filters: Read only a fraction of the index into memory
Dictionary compression: Each SSTable data block is compressed (lz4, zstd, snappy)
Rate limiter: Throttle compaction I/O to avoid starving user reads

Cassandra (Apache)

Cassandra uses tiered compaction by default (SizeTieredCompactionStrategy):

Table schema:
CREATE TABLE user_timeline (
    user_id UUID,
    timestamp TIMESTAMP,
    content TEXT,
    PRIMARY KEY (user_id, timestamp)
) WITH compaction = {'class': 'LeveledCompactionStrategy'};

Cassandra bloom filters are stored off-heap (in native memory, not Java heap) and are serialized with each SSTable. When a node restarts, bloom filters are loaded into memory (not rebuilt from data) — this takes seconds for hundreds of GB of data.

📊 Summary: LSM Tree Trade-offs

Aspect	LSM Tree	B-tree
Random writes	Fast (buffered in MemTable)	Slow (4KB random writes to disk)
Sequential reads	Fast (within SSTable)	Fast (B-tree traversal)
Random reads	Slow (check N levels, even with bloom filters)	Fast (single traversal)
Space amplification	Medium (duplicate keys across levels)	Low (in-place updates)
Write amplification	5-30× depending on compaction	Low (~1× with in-place update)
Range scans	Fast (sorted SSTables, merge)	Fast (in-order leaf traversal)
Concurrent writes	Good (MemTable is lock-free skip list)	Moderate (page latch contention)
Crash recovery	Fast (replay WAL)	Slow (redo log replay + recovery)

LSM trees are the dominant design for modern write-heavy workloads: time-series databases (InfluxDB/TimescaleDB), key-value stores (RocksDB/LevelDB), wide-column stores (Cassandra/Bigtable/HBase), and search indices (Lucene/Solr/Elasticsearch segments are LSM-like).

The next time you choose a database, remember: you’re really choosing where on the RUM triangle you want to sit.

← Previous
🎬 What Happens When You Run a Program

Next →
⚡ Compiled vs JIT vs Interpreted

← [b]ack

posts/ 🧠 LLM from Scratch — GPT-Style Transformer in PyTorch [n]ext → posts/ ⚡ Why Is X Language Fast or Slow? — Compiled vs JIT vs Interpreted