← Concepts
Reliability·3 min read

Replication

Keep copies of your data on multiple nodes so you don't lose it (or your ability to serve) when one dies.

First time reading this? Start here

Plain English: keep the same data on multiple machines. If one dies, the others still have it. If one is slow, you read from another. The price you pay: keeping the copies in sync.

Used in:URL ShortenerWhatsAppDistributed Cache
What it is

Maintaining synchronized copies of the same data across multiple nodes. Common topologies: leader-follower (one writer, many readers), multi-leader (multiple writers, conflict resolution needed), leaderless (any node accepts writes, quorums coordinate).

The problem it solves

Disks fail. Datacenters lose power. Without copies, that's data loss. Replication also lets you scale reads by serving them from followers, and keep serving during a leader outage by promoting a follower.

How it works

Leader-follower: writes go to the leader, which streams changes to followers (sync or async). Reads can go anywhere. On leader failure, a follower is promoted. Multi-leader: writes go to any leader, changes replicate between them (conflicts inevitable). Leaderless (Dynamo-style): writes go to N replicas, reads from R replicas, with R+W>N for read-your-writes guarantees.

Why use it
What it costs you
Where it shows up in our architectures
Gotchas
When this went wrong in production

AWS S3 us-east-1 melts the internet · 2017

Postmortem ↗

One typo in a routine S3 maintenance command took down half the internet for 4 hours.

An engineer ran a debug subcommand to remove a small number of capacity servers from S3 us-east-1. A typo expanded the scope to a much larger set, including servers running the index subsystem and placement subsystem. S3 lost the index → every read started failing. Cascading failure: every AWS service that depended on S3 (which was most of them: Lambda, ECS, CloudWatch, even the AWS Console) degraded. Took 4+ hours to restart the index subsystem because it hadn't been restarted at scale in years; the cold-start path itself was the bottleneck. The lesson: capacity-management commands need scope validation, AND your critical recovery paths need to be exercised regularly so they don't atrophy.

GitLab database deletion · 2017

Postmortem ↗

An engineer ran rm -rf on the wrong database during a 1am incident response.

At 1am, GitLab's secondary database was lagging badly. An engineer trying to fix replication accidentally ran `rm -rf` on the PRIMARY database directory instead of the secondary. Production data, gone. They had FIVE backup mechanisms in place: snapshots, dumps, replicas, etc. Four of them were silently broken or empty. The fifth had a 6-hour-old backup. They lost 6 hours of project data, 5,000 projects, 5,000 comments, 700 new users. The lesson: backups that aren't tested aren't backups. Restore drills are not optional. They're the only thing that proves your backup strategy works.

GitHub 24-hour partition · 2018

Postmortem ↗

A 43-second network partition triggered 24 hours of data inconsistency.

A 43-second network partition between GitHub's US-East and US-West data centers caused MySQL clusters in both regions to elect themselves primary (split-brain). When the partition healed, both regions had accepted writes and now had divergent state. GitHub chose consistency over availability: they took the service degraded for 24+ hours while they manually reconciled the diverged writes across clusters. The lesson: CAP isn't a textbook curiosity. When the partition heals, you've already made the C-vs-A choice. Your reconciliation strategy IS your CAP choice expressed in code.

Your notes

Private to you