Reqflow
← All concepts
Distributed Systems·3 min read

CAP Theorem

When a network partition happens, you have to pick: consistency or availability. You can't have both.

Try it

Break the network, then decide what Node B does.

Node A
x=2
took the latest write
replicating
Node B
x=2
in sync

A and B agree (both 2) and replicate continuously. Now a client writes a new value to A. In a healthy network, B catches up instantly.

There was no third option. Once the network splits (P), you keep C or A, never both. That is the whole theorem: partitions happen, so you are really choosing C vs A.

First time reading this? Start here

Plain English: when the network breaks between two servers holding the same data, your system has to choose: return possibly-wrong data, or return nothing at all. You can't have both.

What it is

A statement about distributed systems with replicated state. Given three desirable properties (Consistency, where every read sees the latest write; Availability, where every request gets a non-error response; and Partition tolerance, where the system keeps working despite network splits), you can guarantee at most two at the same time. Since partitions WILL happen in any real distributed system, the real choice is between C and A when a partition occurs.

The problem it solves

Forces an honest conversation about what your system should do when a network split occurs. There's no magic. Either you refuse writes on one side of the partition (CP, choosing consistency) or you accept divergent writes you'll have to reconcile later (AP, choosing availability). Pretending you have all three is a recipe for surprise outages or silent data corruption.

How it works

CAP isn't an algorithm; it's a constraint, so here's what the choice looks like in practice. A leader-based distributed store like etcd or ZooKeeper is CP: if the leader is unreachable, writes fail rather than diverge. Spanner is also CP in the strict sense, but its global Paxos + TrueTime infrastructure makes the unavailable windows so short that most apps treat it as 'C with effectively high A'. An eventually-consistent system (Cassandra at default consistency, Dynamo) is AP: writes succeed on whichever replica is reachable; reads may show stale data until convergence. (Single-node Postgres isn't really in scope here, since without replication there's no partition to choose against.)

Why use it

  • Forces explicit choices about consistency vs availability
  • Names the tradeoff clearly: no architecture can promise all three under partition

What it costs you

  • Often misunderstood, since CAP applies only during partitions, not normal operation
  • Most systems are tunable (per-query, per-table), and CAP is a binary that doesn't capture this nuance
  • PACELC is a more complete framing: it adds 'when there's no partition, choose Latency or Consistency'

Where it shows up in our architectures

  • WhatsApp

    Cassandra for the message store is AP: during a partition, writes succeed on the reachable replicas; reads may briefly miss recent messages on a recovering node

  • Payment Gateway

    Postgres ledger is CP: money writes through a single primary, so partition means no writes (the alternative, divergent ledgers, is worse than downtime)

  • Distributed Cache

    Caches are typically AP: under partition you'd rather serve a slightly stale value than fail the read

Gotchas

  • CAP only applies during a partition. The rest of the time you can have all three, since partitions are the edge case the theorem is talking about.
  • PACELC is the better framing: even without a partition, you trade Latency for Consistency (sync replication vs async).
  • 'CA' systems don't exist in distributed systems, because partitions are inevitable. Anything claiming CA is really CP that hasn't been tested under partition.
  • Per-query tunability (Cassandra's consistency levels, DynamoDB's strongly-consistent reads) means most real systems aren't purely C or A; they're configurable per access.
When this went wrong in production

GitHub 24-hour partition · 2018

Postmortem ↗

A 43-second network partition triggered 24 hours of data inconsistency.

A 43-second network partition between GitHub's US-East and US-West data centers caused MySQL clusters in both regions to elect themselves primary (split-brain). When the partition healed, both regions had accepted writes and now had divergent state. GitHub chose consistency over availability: they took the service degraded for 24+ hours while they manually reconciled the diverged writes across clusters. The lesson: CAP isn't a textbook curiosity. When the partition heals, you've already made the C-vs-A choice. Your reconciliation strategy IS your CAP choice expressed in code.

Interview angle

Interviewers use CAP to probe whether you understand the real trade-off: during a network partition, do you keep serving potentially stale data (AP) or do you refuse requests until you're consistent (CP)? The key is to name the actual system you'd choose (Cassandra for AP, ZooKeeper for CP) and explain why that matches the business requirement. Candidates lose points by saying 'you can have all three': you can't, and that answer signals you memorized the acronym without understanding the partition scenario.

Your notes

Private to you