Notification System: System Design

Requirements & API: Notification System

The first move in any interview: define requirements and sketch the API before drawing a single box.

Functional requirements

•Accept a notification request from any internal producer (recipient, type, payload) and fan it out to the right channels.
•Respect per-user channel preferences (email yes, SMS no, push only during the day) for every send.
•Deliver across email, SMS, and push via per-channel workers calling external providers.
•Retry failed sends with backoff; handle bounces and unsubscribes per channel.

Non-functional requirements

•Asynchronous and decoupled: a slow SMS provider must never back up the email queue.
•At-least-once delivery with dedup downstream; missing a password-reset email is worse than a rare duplicate.
•Independent per-channel scaling, since channels have very different throughput and failure profiles.
•Producer-facing API must stay fast: enqueue and return, never block on actual delivery.

API contract

POST /v1/notifications { user_id, type, payload, channels? } → { notification_id, status: "queued" }

Enqueues then returns immediately. Preferences resolve the actual channels.

GET /v1/preferences/{user_id} → { email, sms, push, quiet_hours }

Read-heavy; cached in Redis.

PUT /v1/preferences/{user_id} { email, sms, push, quiet_hours } → 200

Invalidates the cached prefs.

About Notification System

When your order ships, a phone buzzes. Maybe an email lands too, but no text, because you turned SMS off months ago. Behind that simple moment is a whole platform whose job is to take one event from some internal service and turn it into the right messages, on the right channels, for the right person. A notification system looks easy until you realize every team in the company wants to send something, and every user wants different rules.

Here is the whole thing in plain steps. An internal producer, say the order service, POSTs a request to the API gateway, which authenticates it and rate-limits per producer. The Notification Service then looks up that user's preferences (email on, push on, SMS off) and creates one job per enabled channel, dropping each into Kafka and returning 200 immediately. Per-channel workers consume their own topics: the email worker calls SendGrid, the SMS worker calls Twilio, the push worker calls APNs or FCM, each retrying with backoff when a provider fails.

The reason for separate queues per channel is best seen with a traffic analogy. Imagine email, SMS, and push all sharing one lane on a highway. The moment the SMS provider slows to a crawl, every email and push behind it is stuck too. Giving each channel its own lane means a backed-up SMS provider never delays a password-reset email. Each channel scales and fails on its own.

Two decisions carry the design. First, preferences live in one service, not in each producer, so a user's 'no SMS at night' rule is honored no matter which team triggered the message, and those read-heavy prefs get cached in Redis to keep the hot path fast. Second, the producer's call returns the instant the jobs are queued, so its latency is never tied to Apple's push servers. Delivery is at-least-once with idempotency keys, because a rare duplicate is far better than a missing password reset. This system teaches asynchronous fan-out through a queue, per-channel isolation and independent scaling, centralized user preferences, and the build-vs-buy case for outsourcing delivery to providers like SendGrid and Twilio.