Distributed Cache: System Design

Requirements & API: Distributed Cache

What an interviewer expects you to nail down before drawing a single box.

Functional

•GET/SET/DELETE a key with sub-millisecond latency, routing to the node that owns the key's hash range.
•Distribute keys across nodes via a consistent-hash ring so adding/removing a node remaps only ~1/N keys.
•Replicate each key range to a neighbor so a single node loss doesn't lose data.
•Fall through to the origin DB on a miss and back-fill the cache (cache-aside / write-through).

Non-functional

•Sub-ms reads/writes. The whole point of the cache is to keep traffic off the origin DB.
•Elastic membership: nodes join/leave and clients learn the new ring topology within seconds.
•Tolerate hot keys: a single celebrity key can get 100× average traffic and must not overwhelm one node.
•Availability over strong consistency: it's a cache; brief staleness is acceptable, the origin DB is the truth.

API contract

get(key) → value | nil

Client hashes the key locally and talks directly to the owning node.

set(key, value, ttl?) → ok

Written to the owner and replicated to its neighbor.

delete(key) → ok

internal: ring topology (node → hash range) via ZooKeeper/etcd watch

Clients subscribe to membership changes.

About Distributed Cache

Almost every fast website is fast because it rarely touches its database. It keeps the answers it needs in a distributed cache, a fleet of in-memory servers that return values in well under a millisecond. The job of the design is to spread billions of keys across those servers so no single one is overloaded, and to keep working smoothly even as servers join and leave.

Here is the whole thing in plain terms. When your app server wants a key, it hashes the key locally to figure out which cache node owns it, then connects straight to that node, with no proxy in between. On a hit it gets the value instantly. On a miss the node falls through to the origin database, fetches the value, stores a copy with a TTL, and returns it, so the next read for that key is fast. Each node also replicates its keys to a neighbor, so losing one node doesn't lose data.

The clever part is how keys are assigned, using consistent hashing. Picture the hash space as a clock face and each node sitting at a position on it. A key belongs to the next node clockwise. The payoff is what happens when you add or remove a node: only the keys in one arc move, about one in N of them, while everything else keeps its current owner. A plain key-modulo-N scheme would reshuffle almost every key and empty the whole cache at once.

There is one failure mode this design can't hash its way out of: the hot key. Consistent hashing spreads keys evenly on average, but a single celebrity profile can draw 100 times the normal traffic, and it all lands on whichever node owns it until that node melts. The fix is to detect the hot key, replicate it across several nodes, and have clients read from a random copy. A coordination service like ZooKeeper or etcd keeps every client agreeing on the current ring, so two clients never disagree about who owns a key.

This system teaches consistent hashing and why it minimizes remapping, replication for single-node fault tolerance, cache-aside read-through behavior against an origin database, the hot-key problem and how to spread it, and why a consensus store is used to broadcast ring membership.