Requirements & API: Dropbox (File Sync)

What an interviewer expects you to nail down before drawing a single box.

Functional

  • Sync a watched folder across a user's devices, uploading only the chunks that changed.
  • Chunk files (~4MB), content-address them by hash, and store each unique chunk exactly once.
  • Maintain the per-user file tree, version history, and a chunk manifest per file.
  • Notify a user's other devices when a file changes so they pull the new version.

Non-functional

  • Block-level delta sync: a one-line edit to a 1GB file must ship ~4MB, not 1GB.
  • Content-addressed dedup across all users. The same chunk uploaded by 1000 people is stored once.
  • Commit metadata durably before firing the change notification (event-after-commit ordering).
  • Strong consistency on metadata (atomic moves, permissions). The immutable chunk store needs no invalidation.

API contract

POST /api/v1/chunks/check { hashes[] } → { missing_hashes[] }
Client asks which chunks the server lacks; only those are uploaded. The dedup win.
PUT /api/v1/chunks/{sha256} → 200
Content-addressed write; a no-op if the chunk already exists from any user.
POST /api/v1/files { path, chunk_manifest[] } → { file_id, version }
Commits the manifest in MySQL, then publishes file.changed to wake other devices.

About Dropbox (File Sync)

You change one line in a 1GB log file in your Dropbox folder, and a few seconds later that change is on your phone and your work laptop. Dropbox didn't re-upload a gigabyte. It uploaded about 4MB. The whole design is built around moving as few bytes as possible while keeping every device in sync.

Here is how it works in plain steps. The desktop client watches your folder and splits each file into roughly 4MB chunks, computing a content hash for each one. Before uploading, it asks the server which of those hashes are missing, and sends only those. The Upload Service writes each new chunk into an immutable object store (S3) keyed by its hash, the Metadata Service records the file's chunk manifest in MySQL, and only after that commit does it publish a file.changed event that wakes your other devices to pull the new chunks.

The idea worth slowing down on is content-addressed storage: the chunk's own hash is its storage key. Think of a coat check where the ticket is computed from the coat itself. If a thousand people hand in the identical coat, they all get the same ticket and the cloakroom stores just one. The same Linux ISO uploaded by a thousand users is stored exactly once, and because chunks are immutable, the hash also doubles as an integrity check and you never have to invalidate a cache.

Notice the two stores split by what they hold. Big immutable chunk bytes go to cheap, near-infinite S3, while small, frequently-updated metadata (the file tree, versions, permissions, manifests) lives in relational MySQL where joins and atomic moves are easy. The notification is fired only after the metadata commit, never before, so another device can't race to pull a version that doesn't exist yet.

This system teaches block-level delta sync, content-addressed storage and the dedup, integrity, and no-cache-invalidation that fall out of it for free, splitting blob storage from metadata by access pattern, and the event-after-commit ordering rule for any system that mixes a database write with a fan-out notification.