Reqflow
← All articles
Deep DiveMarch 7, 2026·11 min read

LSM Trees vs B-Trees: Why Your Database's Storage Engine Is a Design Decision

Every database makes a fundamental choice between write-optimized and read-optimized storage. Here's what that means for your workload.

databasesstorage engineslsm-treeb-treeinternals

Two storage engine designs dominate modern databases: B-trees (PostgreSQL, MySQL InnoDB, SQLite) and Log-Structured Merge trees (Cassandra, RocksDB, LevelDB, ScyllaDB). They make opposite trade-offs between write performance and read performance. Most engineers use these databases every day without knowing which is underneath. Then they get surprised by the performance characteristics in production.

B-trees: random writes to a sorted structure

A B-tree is a balanced tree of fixed-size pages (typically 4KB or 8KB), kept always sorted by key. Reads are fast: to find a key, you traverse from root to leaf, typically 3-4 page reads for billion-row tables. Writes are also reasonably fast for small updates: find the page, overwrite the value in-place. The problem is random writes at scale: writing a single byte requires reading an entire 4KB page from disk, modifying it in memory, and writing the whole page back. For write-heavy workloads, this causes high write amplification (ratio of bytes written to disk vs. bytes written by the application). B-tree pages also suffer from fragmentation over time, requiring VACUUM or similar maintenance operations.

LSM trees: sequential writes everywhere

LSM trees never modify data in place. Every write is an append to an in-memory buffer (the memtable). When the memtable fills up, it's flushed to disk as a sorted, immutable file called an SSTable. SSTables are never modified. They're always written sequentially. This makes writes extremely fast: every write is a sequential disk append, which is 10-100x faster than a random write on a spinning disk and significantly faster even on SSDs. The trade-off is reads: to find a key, you might have to search multiple SSTables plus the memtable. Without optimizations, a read is O(number of SSTables), which grows without bound.

Compaction: the background cost of LSM writes

LSM trees manage the growing set of SSTables through compaction: a background process that reads multiple SSTables, merges them, drops deleted/overwritten keys, and writes out a smaller set of new SSTables. Compaction keeps the number of SSTables bounded so reads stay fast, but it costs I/O: for every byte written by the application, the storage engine may write it to disk 5-20x across the lifetime of that data (write amplification). Poorly tuned compaction is why Cassandra and RocksDB deployments sometimes show disk I/O much higher than application writes. That's also why Discord's Cassandra migration story mentioned compaction as a pain point.

Bloom filters: making LSM reads fast

The standard optimization for LSM reads is a Bloom filter per SSTable: a probabilistic data structure that answers 'definitely not in this SSTable' or 'probably in this SSTable' in nanoseconds. Before searching an SSTable, you check its Bloom filter. If it says 'no', you skip the entire SSTable. This reduces the number of SSTables you actually search from O(all SSTables) to O(SSTables that might contain the key), which in practice is usually 1-2 for point lookups. This is why Bloom filters and LSM trees are always mentioned together. The Bloom filter is what makes the read-write trade-off tolerable.

When to choose each

B-trees win on read-heavy workloads with point lookups and range scans (reporting, OLAP, transactional banking), mixed read-write where predictable latency matters more than peak throughput, and when you need strong consistency and ACID transactions (PostgreSQL is B-tree based for a reason). LSM trees win on write-heavy workloads (event streams, time-series, message storage, logging), when write throughput is the bottleneck, and when you're sharding across many nodes (Cassandra, RocksDB). The canonical production answer: use PostgreSQL (B-tree) for your application database, Cassandra or RocksDB (LSM) for your append-heavy event or message store.

See it in action

← Back to all articles