π³ LSM Trees & Bloom Filters β Production Deep Dive
Why LSM trees exist, how they work (MemTable β WAL β SSTable β compaction), the read path with bloom filters, tiered vs leveled compaction, write amplification, and the RUM conjecture.
π§± The Problem LSM Trees Solve
Databases are built around one fundamental constraint: random writes are slow on physical media.
| Media | Random Write | Sequential Write | Ratio |
|---|---|---|---|
| HDD | ~0.5 MB/s (seek time ~10ms) | ~200 MB/s | 1:400 |
| SSD | ~50 MB/s (write amplification, erase blocks) | ~2000 MB/s | 1:40 |
If you insert 1,000 rows into a B-tree one at a time, each insert triggers a random write to disk. On an HDD, thatβs 10 seconds of seek time for 1,000 random inserts.
LSM (Log-Structured Merge) trees solve this by buffering writes in memory and flushing them as large sequential batches to disk. The idea originated in the 1996 Ousterhout paper on log-structured file systems and was popularized by Googleβs Bigtable (2006), then LevelDB, RocksDB, Cassandra, ScyllaDB, and many others.
ποΈ LSM Tree Architecture
An LSM tree has three main components:
βββββββββββββββ
β MemTable β β in-memory balanced tree (red-black / skip list)
β (read+write)β
ββββββββ¬βββββββ
β flush (when full)
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββ
β WAL (Write-Ahead Log) on disk β
β ββββββ¬βββββ¬βββββ¬βββββ¬βββββ¬βββββ¬βββββ¬βββββ β
β β op β op β op β op β op β op β op β op β β β sequential append
β ββββββ΄βββββ΄βββββ΄βββββ΄βββββ΄βββββ΄βββββ΄βββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββ
β Immutable β β frozen, no longer accepting writes
β MemTable β
ββββββββ¬βββββββ
β flush to disk
βΌ
βββββββββββββββββββββββββββ
β SSTable File (Level 0) β β sorted, immutable data file
β [Footer] β
β [Index Block] β
β [Bloom Filter Block] β
β [Data Blocks...] β
βββββββββββββββββββββββββββ
β
compaction merges
βΌ
βββββββββββββββββββββββββββ
β SSTable (Level 1) β β larger, merged
βββββββββββββββββββββββββββ
MemTable
The MemTable is an in-memory data structure (typically a skip list in LevelDB/RocksDB or a red-black tree). All writes go here first:
// Pseudocode: skipping to a skip list
void put(const Slice& key, const Slice& value) {
WAL_append(key, value); // Durability: write to log first
memtable->insert(key, value); // Then update in-memory table
}
The MemTable supports O(log n) reads and writes. When it reaches a configurable size (typically 4-64 MB), itβs made immutable and a new MemTable takes its place.
WAL (Write-Ahead Log)
Before updating the MemTable, every write is appended to the WAL β a sequential file on disk. This provides durability: if the process crashes, the WAL is replayed on restart to reconstruct the MemTable.
- WAL is written sequentially β fast on both HDD and SSD
- On crash recovery: replay WAL from last checkpoint
- WAL can be
fsyncβd per write (durability) or batched (performance)
SSTable (Sorted String Table)
When an immutable MemTable is flushed to disk, it becomes an SSTable β an immutable, sorted file on disk. The format:
βββββββββββββββββββββββββββββββ
β Data Block 0 (keys a-f) β
β Data Block 1 (keys g-l) β
β Data Block 2 (keys m-r) β
β Data Block 3 (keys s-z) β
βββββββββββββββββββββββββββββββ€
β Bloom Filter (all keys) β β false-positive check
βββββββββββββββββββββββββββββββ€
β Index β β for each block: last key + offset
β βββββββββββββββββββββββββ β
β β Block 0: offset=0 β β
β β Block 1: offset=4096 β β
β β Block 2: offset=8192 β β
β β Block 3: offset=12288 β β
β βββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββ€
β Footer β β pointer to index + bloom filter
βββββββββββββββββββββββββββββββ
The index block at the end is loaded into memory on SSTable open. It maps the last key of each data block to the blockβs offset, so the read path can binary-search the index to find which data block to load.
π The Read Path (Step by Step)
Reading a key from an LSM tree is more expensive than writing to one because data exists in multiple levels:
Read "key_xyz":
1. Check MemTable (in memory)
βββ Found? Return immediately.
βββ Not found? Continue.
2. Check immutable MemTable (in memory)
βββ Found? Return immediately.
βββ Not found? Continue.
3. For each SSTable, newest to oldest:
a. Query the SSTable's bloom filter
βββ "key_xyz definitely not here" β skip this SSTable
βββ "key_xyz might be here" β continue
b. Binary-search the in-memory index to find the relevant data block
c. Read and decompress the data block from disk
d. Binary-search within the data block
βββ Found? Return value.
βββ Not found? Continue to next SSTable.
4. Key does not exist.
This is why LSM-tree reads can be expensive: in the worst case, you check every level. Bloom filters are critical for skipping SSTables that canβt contain the key.
πΈ Bloom Filters: How They Work
A bloom filter is a probabilistic data structure that answers: βIs element x in set S?β
- No false negatives: If the filter says βnoβ, the element is definitely not in the set.
- False positives possible: If the filter says βyesβ, the element might be in the set (we still need to check the actual data).
The Math
A bloom filter is a bit array of m bits with k hash functions.
Insert "apple":
h1("apple") = 2 β set bit 2
h2("apple") = 7 β set bit 7
h3("apple") = 12 β set bit 12
Insert "banana":
h1("banana") = 5 β set bit 5
h2("banana") = 7 β set bit 7 (already set)
h3("banana") = 1 β set bit 1
Query "apple":
h1("apple") = 2 β bit is 1 β
h2("apple") = 7 β bit is 1 β
h3("apple") = 12 β bit is 1 β β "probably present"
Query "grape":
h1("grape") = 2 β bit is 1
h2("grape") = 9 β bit is 0 β β "definitely absent"
Optimal Parameter Calculation
Given n elements and desired false-positive rate p:
m = -n Β· ln(p) / (ln(2))Β² // optimal number of bits
k = (m/n) Β· ln(2) // optimal number of hash functions
| Desired FPR | m/n (bits per key) | k (hash functions) |
|---|---|---|
| 10% | 4.8 | 3 |
| 1% | 9.6 | 7 |
| 0.1% | 14.4 | 10 |
| 0.01% | 19.2 | 14 |
RocksDB default SSTable bloom filter: 10 bits per key (~0.8% false positive rate).
Practical Implementation
A common optimization: each SSTable has its own bloom filter. The filter for L0 might have 100 keys (smaller, less accurate), while the filter for L3 might have 10M keys (larger, more memory).
// Simplified bloom filter class
class BloomFilter {
std::bitset<m> bits;
HashFn hashes[k];
void insert(Slice key) {
for (auto& h : hashes) {
bits.set(h(key) % m);
}
}
bool possibly_contains(Slice key) {
for (auto& h : hashes) {
if (!bits.test(h(key) % m)) return false;
}
return true; // may be false positive
}
};
π Compaction: Why LSM Trees Donβt Grow Forever
Without compaction, youβd have thousands of SSTables on disk, and every read would need to check all of them. Compaction merges SSTables together and discards old data.
Tiered Compaction (Cassandra / HBase)
Level 0: [sst1] [sst2] [sst3] [sst4] β up to N files
β merge to L1
Level 1: [sst_large] β single merged file
β when L1 gets big
Level 2: [sst_even_larger]
Strategy: Each level can hold up to N SSTables. When a level exceeds N, all its files are merged into one file in the next level.
- Pro: Lower write amplification (data is compacted only once when promoted)
- Con: Temporary space amplification (multiple copies of the same key exist)
Leveled Compaction (LevelDB / RocksDB default)
Level 0: [sst1] [sst2] [sst3] β files may overlap in key range
Level 1: ββββ¬βββ¬βββ¬βββ¬βββ¬βββ β non-overlapping, sorted runs
βa-eβf-jβk-oβp-tβu-xβy-zβ
ββββ΄βββ΄βββ΄βββ΄βββ΄βββ
Level 2: ββββ¬βββ¬βββ¬βββ¬βββ¬βββ¬βββ¬βββ¬βββ¬βββ¬βββ¬βββ
βa-cβd-fβg-iβj-lβm-oβp-rβs-uβv-xβy-zβ...
ββββ΄βββ΄βββ΄βββ΄βββ΄βββ΄βββ΄βββ΄βββ΄βββ΄βββ΄βββ
(10Γ larger)
Strategy: Each level is 10Γ larger than the previous. An L1 file is compacted into a subset of L2 files that overlap in key range.
- Pro: Better read performance (fewer files to check per level)
- Con: Higher write amplification (same key data gets rewritten multiple times as it flows down levels)
Write Amplification
Write amplification is the ratio of bytes written to disk vs bytes of new data ingested:
Write Amplification = (Total bytes written to disk) / (Ingested data)
| Compaction | Typical Write Amp | Typical Space Amp |
|---|---|---|
| Leveled (LevelDB) | 10-30Γ | 1.1Γ |
| Tiered (Cassandra) | 3-10Γ | 1.5-3Γ |
| Size-Tiered (HBase) | 4-8Γ | 2-5Γ |
A write amplification of 20Γ means: if you write 1 GB of new data, 20 GB is written to disk (including compaction rewrites). This matters for SSD lifespan β consumer SSDs are rated for ~100-300 TBW (Total Bytes Written).
π The RUM Conjecture
The RUM Conjecture (Read Overhead, Update Overhead, Memory/Storage Overhead) states that for a data structure or access method:
You can optimize any two of Read, Update, and Memory but must sacrifice the third.
Read
β²
β
B-tree β Hash table
(R low, β (R low,
U high, β U low,
M mid) β M high)
β
ββββββββββββΊ Update
β
β
LSM Tree β
(R high, β
U low, β
M mid) β
β
Memory
- B-tree: Fast reads (one traversal to leaf), slow random writes (cache line split, rebalancing), medium memory.
- Hash table: Fast reads (O(1)), fast writes (O(1) amortized), high memory (keep everything in RAM).
- LSM tree: Fast writes (sequential), slow reads (check N levels), medium memory (bloom filters + index blocks).
In practice, you tune LSM parameters to shift along the RUM triangle:
- More bloom filter bits β more memory, fewer false positives β faster reads
- Smaller SSTable sizes β more files to check β slower reads, faster compaction
- Leveled compaction β better reads, worse write amplification
π Production Examples
RocksDB (Facebook/Meta)
RocksDB powers MySQLβs MyRocks storage engine, Apache Kafkaβs internal state stores, and many more systems:
// RocksDB configuration for write-heavy workload
Options options;
options.create_if_missing = true;
options.write_buffer_size = 64 << 20; // 64 MB MemTable
options.max_write_buffer_number = 4; // up to 3 immutables
options.target_file_size_base = 64 << 20; // 64 MB SSTables
options.max_bytes_for_level_base = 512 << 20; // 512 MB for L1
options.soft_pending_compaction_bytes_limit = 64ULL << 30;
options.level0_slowdown_writes_trigger = 20;
options.level0_stop_writes_trigger = 36;
Key RocksDB features:
- Prefix bloom filters: Skip reading entire SSTable when prefix is known
- Partitioned index/filters: Read only a fraction of the index into memory
- Dictionary compression: Each SSTable data block is compressed (lz4, zstd, snappy)
- Rate limiter: Throttle compaction I/O to avoid starving user reads
Cassandra (Apache)
Cassandra uses tiered compaction by default (SizeTieredCompactionStrategy):
Table schema:
CREATE TABLE user_timeline (
user_id UUID,
timestamp TIMESTAMP,
content TEXT,
PRIMARY KEY (user_id, timestamp)
) WITH compaction = {'class': 'LeveledCompactionStrategy'};
Cassandra bloom filters are stored off-heap (in native memory, not Java heap) and are serialized with each SSTable. When a node restarts, bloom filters are loaded into memory (not rebuilt from data) β this takes seconds for hundreds of GB of data.
π Summary: LSM Tree Trade-offs
| Aspect | LSM Tree | B-tree |
|---|---|---|
| Random writes | Fast (buffered in MemTable) | Slow (4KB random writes to disk) |
| Sequential reads | Fast (within SSTable) | Fast (B-tree traversal) |
| Random reads | Slow (check N levels, even with bloom filters) | Fast (single traversal) |
| Space amplification | Medium (duplicate keys across levels) | Low (in-place updates) |
| Write amplification | 5-30Γ depending on compaction | Low (~1Γ with in-place update) |
| Range scans | Fast (sorted SSTables, merge) | Fast (in-order leaf traversal) |
| Concurrent writes | Good (MemTable is lock-free skip list) | Moderate (page latch contention) |
| Crash recovery | Fast (replay WAL) | Slow (redo log replay + recovery) |
LSM trees are the dominant design for modern write-heavy workloads: time-series databases (InfluxDB/TimescaleDB), key-value stores (RocksDB/LevelDB), wide-column stores (Cassandra/Bigtable/HBase), and search indices (Lucene/Solr/Elasticsearch segments are LSM-like).
The next time you choose a database, remember: youβre really choosing where on the RUM triangle you want to sit.