What an interviewer expects you to nail down before drawing a single box.
WS send: { channel_id, client_msg_id, body } → ack { msg_id, ts }Pub/Sub publish: channel:{id} → { msg_id, sender, body, ts }GET /api/v1/channels/{id}/messages?since={cursor} → { messages[], next_cursor }Picture a busy Slack channel with a few hundred people in it. Someone types a message, and it has to show up on everyone's screen at once. That fan-out is the whole problem. A chat app is real-time messaging with a twist: instead of sending one message to one person, you send it to everyone in a room, and those people are connected to many different servers.
The surprising part is what the system is sized around. It is not message volume; most workplaces don't chat that much. It is connection volume. Every online user holds an open WebSocket the entire time the app is running, so you are building for millions of live connections, not millions of messages per second.
Here is how a message travels. You post to a channel. The chat service needs to reach every member's connection, but those connections are scattered across many gateway servers. So it publishes the message to the channel's topic on a pub/sub layer (Redis pub/sub or a message bus). Think of it like a radio station: the chat service broadcasts once, and every gateway tuned to that channel hears it and forwards it to the users connected to it. Presence (who's online, who's typing) and read receipts ride along the same path.
One design choice pays off again and again: keep the connection tier separate from the chat logic. Then you can deploy or scale the chat service without dropping everyone's WebSocket. This system covers room-based pub/sub fan-out, the gateway tier, and why scaling for connections is a different problem from scaling for messages.