← Concepts
Sharding·3 min read

Sharding / Data Partitioning

Split one big dataset across N smaller stores so each one stays manageable.

First time reading this? Start here

Plain English: when one database gets too big to fit on one machine, you slice the data into chunks (by user, by region, by some key) and put each chunk on its own machine. Each shard handles just its slice.

Used in:Instagram FeedWhatsAppUber
What it is

Horizontal partitioning of a single logical dataset across multiple physical stores. Each shard holds a slice of the keyspace; queries are routed to the right shard based on a partition key.

The problem it solves

A single database eventually hits a wall on IOPS, RAM, CPU, or storage. Sharding lets you go past that wall by adding more boxes, each handling a fraction of the data and load.

How it works

Pick a partition key (user_id, account_id, geo region) and a partition strategy: hash-based (key % N or consistent hashing, for even distribution), range-based (alphabetical, date ranges, giving locality but risk of hot ranges), or directory-based (a lookup table, flexible but adds a hop). All reads/writes route to the right shard.

Why use it
What it costs you
Where it shows up in our architectures
Gotchas
When this went wrong in production

AWS S3 us-east-1 melts the internet · 2017

Postmortem ↗

One typo in a routine S3 maintenance command took down half the internet for 4 hours.

An engineer ran a debug subcommand to remove a small number of capacity servers from S3 us-east-1. A typo expanded the scope to a much larger set, including servers running the index subsystem and placement subsystem. S3 lost the index → every read started failing. Cascading failure: every AWS service that depended on S3 (which was most of them: Lambda, ECS, CloudWatch, even the AWS Console) degraded. Took 4+ hours to restart the index subsystem because it hadn't been restarted at scale in years; the cold-start path itself was the bottleneck. The lesson: capacity-management commands need scope validation, AND your critical recovery paths need to be exercised regularly so they don't atrophy.

Your notes

Private to you