Reliability·3 min read

Replication

Keep copies of your data on multiple nodes so you don't lose it (or your ability to serve) when one dies.

Try it

Write to the leader and watch the value flow to the followers.

x=1

Leadertakes all writes

x=1

Follower 1serves reads

x=1

Follower 2serves reads

One leader takes the writes and copies them to read-only followers. You scale reads by adding followers, and you get redundancy if the leader dies. The cost: a follower can briefly serve stale data during the replication lag.

First time reading this? Start here

Plain English: keep the same data on multiple machines. If one dies, the others still have it. If one is slow, you read from another. The price you pay: keeping the copies in sync.

Used in:URL Shortener WhatsApp Distributed Cache

What it is

Maintaining synchronized copies of the same data across multiple nodes. Common topologies: leader-follower (one writer, many readers), multi-leader (multiple writers, conflict resolution needed), leaderless (any node accepts writes, quorums coordinate).

The problem it solves

Disks fail. Datacenters lose power. Without copies, that's data loss. Replication also lets you scale reads by serving them from followers, and keep serving during a leader outage by promoting a follower.

How it works

Leader-follower: writes go to the leader, which streams changes to followers (sync or async). Reads can go anywhere. On leader failure, a follower is promoted. Multi-leader: writes go to any leader, changes replicate between them (conflicts inevitable). Leaderless (Dynamo-style): writes go to N replicas, reads from R replicas, with R+W>N for read-your-writes guarantees.

Why use it

Durability: N replicas means N independent failures before data loss
Read scaling: follower reads multiply read capacity
Failover: promote a follower instead of taking an outage

What it costs you

Sync replication is slow (every write waits for replicas); async loses data on leader failure
Read-your-writes guarantees are subtle, since followers may lag the leader
Multi-leader replication produces conflicts that have to be resolved (CRDTs, last-write-wins, or app logic)

Where it shows up in our architectures

URL Shortener →
Postgres read replicas absorb the heavy read load
WhatsApp →
Cassandra replication factor 3 survives single replica loss with no downtime
Distributed Cache →
Each cache node replicates to a neighbor; loss of a node doesn't lose data

Gotchas

Sync vs async replication is the latency-vs-durability dial. Pick consciously per workload: sync for ledgers, async for analytics.
Follower lag is real. If your app needs read-your-writes, route reads to the leader for the affected user.
Multi-leader replication is harder than it looks. Default to single-leader unless you have a specific multi-region or offline-write requirement.

When this went wrong in production

AWS S3 us-east-1 melts the internet · 2017

Postmortem ↗

One typo in a routine S3 maintenance command took down half the internet for 4 hours.

An engineer ran a debug subcommand to remove a small number of capacity servers from S3 us-east-1. A typo expanded the scope to a much larger set, including servers running the index subsystem and placement subsystem. S3 lost the index → every read started failing. Cascading failure: every AWS service that depended on S3 (which was most of them: Lambda, ECS, CloudWatch, even the AWS Console) degraded. Took 4+ hours to restart the index subsystem because it hadn't been restarted at scale in years; the cold-start path itself was the bottleneck. The lesson: capacity-management commands need scope validation, AND your critical recovery paths need to be exercised regularly so they don't atrophy.

GitLab database deletion · 2017

Postmortem ↗

An engineer ran rm -rf on the wrong database during a 1am incident response.

At 1am, GitLab's secondary database was lagging badly. An engineer trying to fix replication accidentally ran `rm -rf` on the PRIMARY database directory instead of the secondary. Production data, gone. They had FIVE backup mechanisms in place: snapshots, dumps, replicas, etc. Four of them were silently broken or empty. The fifth had a 6-hour-old backup. They lost 6 hours of project data, 5,000 projects, 5,000 comments, 700 new users. The lesson: backups that aren't tested aren't backups. Restore drills are not optional. They're the only thing that proves your backup strategy works.

GitHub 24-hour partition · 2018

Postmortem ↗

A 43-second network partition triggered 24 hours of data inconsistency.

A 43-second network partition between GitHub's US-East and US-West data centers caused MySQL clusters in both regions to elect themselves primary (split-brain). When the partition healed, both regions had accepted writes and now had divergent state. GitHub chose consistency over availability: they took the service degraded for 24+ hours while they manually reconciled the diverged writes across clusters. The lesson: CAP isn't a textbook curiosity. When the partition heals, you've already made the C-vs-A choice. Your reconciliation strategy IS your CAP choice expressed in code.

Cloudflare Workers KV stale reads for 35 minutes · 2023

Postmortem ↗

A replication topology change made Workers KV return data that was hours old globally.

Cloudflare Workers KV is a globally distributed key-value store built on eventual consistency: writes propagate to all edges within roughly 60 seconds. During a maintenance operation, an engineer changed the replication topology, specifically which nodes a region's reads fall back to on cache miss. The change accidentally routed reads for a subset of keys to a secondary tier that had stopped receiving updates. Edge nodes across all regions started serving stale values that were hours old, not seconds old. Feature flags, A/B test configs, and auth tokens stored in KV returned wrong results for 35 minutes. The lesson: eventually consistent systems have a defined propagation bound. Any change to the replication topology must be validated against that bound. Breaking propagation doesn't produce errors; it produces silent staleness that can persist indefinitely.

Azure Active Directory outage: MFA breaks for 14 hours · 2023

Postmortem ↗

A corrupted database update took down Azure AD MFA globally, locking millions of users out of Microsoft services.

In September 2023, a routine Azure Active Directory update introduced a corrupted data entry into the authentication service's configuration store. That store is read on every MFA request, so within minutes of deployment, MFA was failing globally. Services that depend on Azure AD for login, including Microsoft 365, Teams, Azure Portal, and Xbox, all started rejecting multi-factor auth. Because MFA was broken, engineers trying to reach the management plane to roll back had to use break-glass procedures. The update was eventually rolled back, but re-validation and cache clearing across global infrastructure took 14 hours. The lesson: configuration stores are critical path for every request. Changes to them must be validated on live traffic via canary before global rollout. Break-glass procedures must not depend on the service they're trying to fix.

Google Docs deletes documents for 0.001% of users · 2023

Postmortem ↗

A storage migration bug silently deleted the document content for a small fraction of Google Docs users.

During a backend storage migration for Google Drive, a race condition in the migration code permanently deleted document contents for a small fraction of users, roughly 0.001% of the user base, which still represents hundreds of thousands of documents. The deletion was silent: Google Drive kept showing the document title and metadata, but opening the document showed blank content. Users didn't realize the content was gone right away, which delayed support tickets. Recovery was partial. Google's cross-region replication had the content, but the deletion had already propagated before the replication lag resolved, making some recent edits unrecoverable. The lesson: data migrations must be zero-destructive. Write, then verify, then delete, with a flag that can be rolled back. Replication protects against node failure, not application-layer bugs that replicate the bug to every replica.

All war stories →

Interview angle

Replication comes up in any discussion of high availability or read scalability. The key question to answer is whether replication is synchronous or asynchronous, because that is the durability vs latency trade-off. Sync replication guarantees no data loss on failover but adds write latency; async is faster but a follower promoted after a crash may be behind. Mention follower lag as a real operational concern and say you'd route read-your-writes queries to the leader. Candidates lose points by adding read replicas without addressing lag and the consistency implications.

Your notes

Private to you