Performance·3 min read

Cache Eviction & Write Policies

How a cache decides what to throw out (LRU/LFU/TTL) and how writes propagate to the backing store (through/back/around).

Try it

Cache holds 4. Access keys; when full, the policy picks who to evict.

empty

Access key

A cache is finite, so when it fills up something must go. LRU evicts whatever was used least recently, FIFO the oldest inserted, LFU the least frequently used. The right policy depends on your access pattern; LRU is the common default.

First time reading this? Start here

Plain English: a cache has limited room, so it needs rules for what to evict when full (usually 'kick out whatever was used least recently') and rules for how writes reach the real database (write to both at once, write to cache and flush later, or skip the cache on writes). These choices decide your speed-vs-correctness tradeoff.

Used in:Distributed Cache URL Shortener Instagram Feed

What it is

Two related sets of policies. Eviction policies decide which entries to drop when the cache is full or stale: LRU (least recently used), LFU (least frequently used), FIFO, and TTL-based expiry. Write policies decide how a write flows between cache and the source of truth: write-through (both synchronously), write-back/write-behind (cache now, DB async), and write-around (DB only, skip the cache).

The problem it solves

A cache has finite memory, so something must be evicted when it fills. Pick wrong and you evict the hot data you needed. And on writes you must decide whether the cache and database stay in lockstep (safe, slower) or diverge temporarily (fast, riskier). These policies are the knobs that set your cache's hit rate, write latency, and staleness window.

How it works

Eviction: LRU tracks recency and drops the coldest entry (good general default, the most common); LFU tracks access counts and keeps the popular ones (better for skewed access but slower to adapt); TTL expires entries after a fixed time regardless of use (the simplest defense against staleness). Writes: write-through updates cache and DB synchronously (consistent, slower writes); write-back updates the cache and flushes to the DB asynchronously (fast writes, risk of data loss if the cache dies before flush); write-around writes straight to the DB and lets the cache populate lazily on the next read (avoids polluting the cache with write-once data).

Why use it

LRU is a cheap, effective default that matches most real access patterns
TTL sidesteps most invalidation headaches, and stale-by-a-bounded-window is usually acceptable
Write-back gives very fast writes by batching them to the DB; write-around keeps the cache clean of cold write-once data

What it costs you

Write-back risks data loss: if the cache node dies before flushing, those writes are gone
LFU adapts slowly: a once-popular key with a high count lingers long after it goes cold (needs aging/decay)
TTLs are guesses: too short kills hit rate, too long serves stale data; there's no universally right value

Where it shows up in our architectures

Distributed Cache →
Per-node eviction (LRU/LFU) plus TTLs decide what each ring node keeps; write policy sets whether it's a look-aside or write-through cache
URL Shortener →
Cache-aside reads with write-around: new short links are written to Postgres and populate Redis lazily on first redirect
Instagram Feed →
Precomputed timelines in Redis carry TTLs and LRU eviction so cold users' feeds get reclaimed for active ones

Gotchas

Reach for LRU first; it's the right default. Only move to LFU if you have a measurably skewed, stable access pattern, and add aging or it never forgets old hot keys.
Write-back is the riskiest write policy: a cache crash before flush silently loses data. Only use it where some loss is acceptable, and back it with replication.
Write-around prevents cache pollution from write-heavy, read-rarely data, but the first read after a write is always a miss. Know that latency cliff exists.
TTL plus LRU together is the pragmatic combo: TTL bounds staleness, LRU bounds memory. Pick the TTL from your real staleness tolerance, not a round number.

When this went wrong in production

Slack's 5-hour outage from a cascading cache failure · 2022

Postmortem ↗

A cache misconfiguration caused a load spike that overwhelmed Slack's databases in sequence.

Slack deployed a Memcached configuration change that accidentally reduced the effective cache size. Requests that would have hit cache started hitting the database. The database absorbed the initial surge but latency crept up. Slower DB responses caused app servers to hold connections longer, exhausting their connection pools. Exhausted pools caused requests to queue. Queued requests timed out and clients retried, amplifying the load. The database load balancer fell over. Slack was effectively down for 5 hours for most users. The lesson: cache and database tiers aren't independent. A cache miss rate increase of just 5-10% can mean 10x database load on a busy system. Monitor cache hit rate as a first-class operational metric and have a circuit breaker for cache degradation.

All war stories →

Interview angle

Cache eviction and write policy questions test whether you know the tradeoffs rather than just the names. For eviction, say LRU is the right default, then explain when you'd switch to LFU (stable, heavily skewed access patterns) and why LFU can be slow to adapt. For write policy, say write-through for safety-critical data and explain write-back's data loss risk. Candidates lose points by listing all three write policies without saying which they'd actually use and why.

Your notes

Private to you