Google Docs (Realtime Collab): System Design

Requirements & API: Google Docs (Realtime Collab)

What an interviewer expects you to nail down before drawing a single box.

Functional

•Let many users edit the same document simultaneously and converge on one consistent result.
•Apply each user's keystrokes locally with instant feedback, then reconcile with the server.
•Show live presence: collaborator cursors and colored selections in real time.
•Persist full edit history for undo / time-travel and bootstrap new clients from a snapshot.

Non-functional

•Strong eventual convergence: concurrent edits to the same region must merge deterministically via OT/CRDT, and never lose or corrupt edits.
•Sub-100ms edit propagation between collaborators; local keystrokes feel instant.
•Durability of every op. An OT server crash must not lose acknowledged edits.
•Survive flaky networks: clients reconnect and replay missed ops without conflict.

API contract

WS connect /docs/{doc_id} → stream of {op, rev}

Long-lived WebSocket; ops flow both directions over it.

send_op { doc_id, base_rev, op } → { transformed_op, new_rev }

Server transforms against concurrent ops, then broadcasts to others.

GET /api/v1/docs/{doc_id} → { snapshot, rev, acl }

Bootstraps a client from the latest snapshot plus ops since.

send_presence { doc_id, cursor, selection }

Ephemeral; routed via the presence service, not the op log.

About Google Docs (Realtime Collab)

Five people open the same Google Doc and all start typing at once, and somehow nobody's words get lost or scrambled. Every keystroke shows up instantly on your own screen and a fraction of a second later on everyone else's. Making concurrent edits merge into one consistent document, while keeping typing feel instant, is the entire challenge here.

Here is the flow in plain terms. Your browser holds the editor and an OT engine, so each keystroke is applied locally right away and then sent over a long-lived WebSocket. A WebSocket Gateway holds that connection and forwards your edit to the OT Server that owns this document. That server keeps the document's canonical sequence of operations in memory, transforms your edit against any concurrent edits, broadcasts the result to everyone else, and appends it to a durable op log.

The subtle idea is Operational Transform itself. Say you type 'X' at position 10 at the same moment a colleague types 'Y' at position 10. Both edits reference 'position 10', but those positions mean different things once the other edit lands. It's like two people giving directions from 'the third house on the left' after a new house has been built on the street: the count has shifted. OT rewrites the second edit so it still points at the right spot, and the document converges instead of corrupting.

Notice how durability and load are handled. The op log is the real document, an append-only history that lets the OT server rebuild its in-memory state after a crash and powers undo and time-travel. To avoid replaying every edit ever made when someone opens a long doc, the system saves a snapshot every N ops and replays only the deltas since, the same snapshot-plus-log trick Git and databases use. Presence, the colored cursors of who's editing, runs as its own service so it doesn't clutter the op stream.

This system teaches operational transforms and CRDTs as two answers to merging concurrent edits, why each document is sticky-routed to a single OT server for a totally-ordered op log, the snapshot-plus-delta pattern for fast loads with full history, and why the most-collaborated documents are the architectural hot spot.