← Concepts
Distributed Systems·3 min read

CAP Theorem

When a network partition happens, you have to pick: consistency or availability. You can't have both.

First time reading this? Start here

Plain English: when the network breaks between two servers holding the same data, your system has to choose: return possibly-wrong data, or return nothing at all. You can't have both.

Used in:WhatsAppPayment GatewayDistributed Cache
What it is

A statement about distributed systems with replicated state. Given three desirable properties (Consistency, where every read sees the latest write; Availability, where every request gets a non-error response; and Partition tolerance, where the system keeps working despite network splits), you can guarantee at most two at the same time. Since partitions WILL happen in any real distributed system, the real choice is between C and A when a partition occurs.

The problem it solves

Forces an honest conversation about what your system should do when a network split occurs. There's no magic. Either you refuse writes on one side of the partition (CP, choosing consistency) or you accept divergent writes you'll have to reconcile later (AP, choosing availability). Pretending you have all three is a recipe for surprise outages or silent data corruption.

How it works

CAP isn't an algorithm; it's a constraint, so here's what the choice looks like in practice. A leader-based distributed store like etcd or ZooKeeper is CP: if the leader is unreachable, writes fail rather than diverge. Spanner is also CP in the strict sense, but its global Paxos + TrueTime infrastructure makes the unavailable windows so short that most apps treat it as 'C with effectively high A'. An eventually-consistent system (Cassandra at default consistency, Dynamo) is AP: writes succeed on whichever replica is reachable; reads may show stale data until convergence. (Single-node Postgres isn't really in scope here, since without replication there's no partition to choose against.)

Why use it
What it costs you
Where it shows up in our architectures
Gotchas
When this went wrong in production

GitHub 24-hour partition · 2018

Postmortem ↗

A 43-second network partition triggered 24 hours of data inconsistency.

A 43-second network partition between GitHub's US-East and US-West data centers caused MySQL clusters in both regions to elect themselves primary (split-brain). When the partition healed, both regions had accepted writes and now had divergent state. GitHub chose consistency over availability: they took the service degraded for 24+ hours while they manually reconciled the diverged writes across clusters. The lesson: CAP isn't a textbook curiosity. When the partition heals, you've already made the C-vs-A choice. Your reconciliation strategy IS your CAP choice expressed in code.

Your notes

Private to you