Amazon S3 (Object Storage): System Design

Requirements & API: Amazon S3 (Object Storage)

The first move in any interview: define requirements and sketch the API before drawing a single box.

Functional requirements

•PUT, GET, DELETE, and LIST objects addressed by (bucket, key) over an HTTPS API.
•Provide read-after-write consistency: a successful GET right after a PUT returns the new object.
•Support object versioning, storage classes (Standard, OneZone-IA, Glacier), and presigned URLs.
•Return object metadata (size, content-type, ETag) cheaply on HEAD without fetching the bytes.

Non-functional requirements

•Durability of 11 nines (99.999999999%). Losing a customer's object is unacceptable.
•Tolerate the loss of an entire Availability Zone with no data loss and no read interruption.
•Effectively unbounded scale: exabytes of data and trillions of objects.
•Keep storage overhead in check: replication costs 3x, while erasure coding reaches the same durability at ~1.5x.

API contract

PUT /{bucket}/{key} (body) → { ETag, version_id }

Acked only after enough cross-AZ copies (or fragments) are durably written, then the metadata pointer is committed.

GET /{bucket}/{key} → object bytes

Reads from an in-sync replica, or reconstructs from parity if a fragment is missing.

HEAD /{bucket}/{key} → { size, content_type, ETag }

Metadata only. Served from the metadata service, never touches data nodes.

GET /{bucket}?list-type=2&prefix=... → { keys[], next_token }

Paginated LIST against the sharded metadata service.

About Amazon S3 (Object Storage)

Think about uploading a photo and trusting it will still be there, intact, years later. That is object storage, and S3 is the canonical example. You PUT an object under a (bucket, key) name over a plain HTTPS API and GET it back later. The promise sounds simple, but the bar is brutal: eleven nines of durability, which means S3 is built to basically never lose your bytes.

The system splits into two planes. A small metadata service maps each (bucket, key) to where the object's bytes actually live, and a much larger data plane holds the bytes. A request first passes identity and access checks, then a stateless API service coordinates the metadata service and the data store. The trick behind read-after-write consistency is ordering: the bytes are written and replicated first, and the metadata pointer is committed last, so a key never points at a half-written object.

How are the bytes kept safe? The straightforward design, and the one we walk through first, is replication. A placement service picks a primary data node, the object is written there, and the primary copies it to two secondary nodes in other Availability Zones. Three copies in three independent failure domains. The placement service watches every node's heartbeats, and if the primary dies it promotes a secondary and rebuilds the missing copy.

Replication is simple but stores 3x the data. At exabyte scale that is expensive, so the common alternative (and what real S3 uses) is erasure coding. An object is split into data and parity fragments, for example six data plus three parity, spread across AZs. Think of it as RAID for a whole data center: any six of the nine fragments rebuild the original. You get the same durability at roughly 1.5x storage instead of 3x. If an AZ goes dark, GETs read the surviving fragments and reconstruct on the fly.

This system teaches the metadata-versus-data split that recurs in GFS, HDFS, and Ceph, why committing metadata last buys strong consistency, and the economics of replication versus erasure coding across failure domains.