Split one big dataset across N smaller stores so each one stays manageable.
Plain English: when one database gets too big to fit on one machine, you slice the data into chunks (by user, by region, by some key) and put each chunk on its own machine. Each shard handles just its slice.
Horizontal partitioning of a single logical dataset across multiple physical stores. Each shard holds a slice of the keyspace; queries are routed to the right shard based on a partition key.
A single database eventually hits a wall on IOPS, RAM, CPU, or storage. Sharding lets you go past that wall by adding more boxes, each handling a fraction of the data and load.
Pick a partition key (user_id, account_id, geo region) and a partition strategy: hash-based (key % N or consistent hashing, for even distribution), range-based (alphabetical, date ranges, giving locality but risk of hot ranges), or directory-based (a lookup table, flexible but adds a hop). All reads/writes route to the right shard.
Posts sharded by user_id; timeline cache sharded by user_id too
Message store (Cassandra) partitioned by conversation_id
Geo index sharded by H3 cell; trips sharded by city
One typo in a routine S3 maintenance command took down half the internet for 4 hours.
An engineer ran a debug subcommand to remove a small number of capacity servers from S3 us-east-1. A typo expanded the scope to a much larger set, including servers running the index subsystem and placement subsystem. S3 lost the index → every read started failing. Cascading failure: every AWS service that depended on S3 (which was most of them: Lambda, ECS, CloudWatch, even the AWS Console) degraded. Took 4+ hours to restart the index subsystem because it hadn't been restarted at scale in years; the cold-start path itself was the bottleneck. The lesson: capacity-management commands need scope validation, AND your critical recovery paths need to be exercised regularly so they don't atrophy.