Distributed Systems·3 min read

Quorum

Require N/2+1 nodes to agree on every operation, so the system stays consistent even when some nodes are down.

Try it

Set the write and read quorums. Watch whether they overlap.

write

both

read

W(3) + R(3) = 6 > 5 → guaranteed overlap, reads see the latest write

Write quorum (W)3

Read quorum (R)3

Across N replicas, a write waits for W acks and a read queries R. If W + R > N, the two sets must share a node, so a read always sees the newest write. Tune W and R to trade write speed against read freshness.

First time reading this? Start here

Plain English: when you have 5 copies of the data, don't trust any single one. Require at least 3 of them to agree on every read and every write. That way you can lose any 2 machines and the survivors still tell the truth.

Used in:WhatsApp Distributed Cache

What it is

A consensus mechanism for distributed systems: an operation is considered successful only if a majority of nodes agree. Variants: read quorum (R), write quorum (W), total replicas (N). The classic rule R+W>N guarantees read-your-writes.

The problem it solves

In a distributed system, nodes can fail or be partitioned. If you require all nodes to agree, any single failure halts you. If you require only one, you risk split-brain (two halves of a partition both claiming to be authoritative). A quorum strikes a balance: tolerate up to N/2 failures while staying consistent.

How it works

On write, send to N replicas; wait for W acks before returning success. On read, query R replicas; the value with the latest version wins. If R+W>N, any read overlaps with any write, so you always see the latest committed value. Used by Raft, Paxos, Dynamo, and Cassandra (with tunable consistency levels).

Why use it

Tolerates up to (N-1)/2 node failures without losing consistency
Tunable: high R + low W = read-heavy; low R + high W = write-heavy
Prevents split-brain, since only one partition can have a majority

What it costs you

Every write waits for W acks, giving higher latency than single-leader writes
Network partitions can leave you with no majority, and then the system halts (CP-side of CAP)
Coordination overhead grows with N and is usually capped at 3, 5, or 7 replicas

Where it shows up in our architectures

WhatsApp →
Cassandra writes typically use a quorum consistency level for the message store
Distributed Cache →
ZooKeeper / etcd for ring membership uses quorum-based consensus

Gotchas

Quorum size is usually odd (3, 5, 7) to avoid ties. 5 is the typical production sweet spot and tolerates 2 failures.
R+W>N is the key property. Configuring R=W=N/2+1 gives you that guarantee with the lowest latency.
Quorum doesn't replace consensus protocols (Raft, Paxos); those handle leader election and log replication on top of quorum reads/writes.

When this went wrong in production

GitHub 24-hour partition · 2018

Postmortem ↗

A 43-second network partition triggered 24 hours of data inconsistency.

A 43-second network partition between GitHub's US-East and US-West data centers caused MySQL clusters in both regions to elect themselves primary (split-brain). When the partition healed, both regions had accepted writes and now had divergent state. GitHub chose consistency over availability: they took the service degraded for 24+ hours while they manually reconciled the diverged writes across clusters. The lesson: CAP isn't a textbook curiosity. When the partition heals, you've already made the C-vs-A choice. Your reconciliation strategy IS your CAP choice expressed in code.

All war stories →

Interview angle

Quorum comes up in any discussion of replicated systems with no single leader (Cassandra, DynamoDB). The interviewer wants you to show you understand R+W>N: if your reads and writes each touch a majority, they always overlap and you never read stale data. Be ready to say how you'd tune W and R for your workload. Candidates lose points by conflating quorum with 'all replicas must respond,' which is not quorum, that's synchronous replication.

Your notes

Private to you