API Rate Limiter: System Design

Requirements & API: API Rate Limiter

What an interviewer expects you to nail down before drawing a single box.

Functional

•For each request, decide allow/deny against the caller's limit (by user_id or IP) before any real work happens.
•Support per-customer limits configurable at runtime (free: 100 req/min, enterprise: 10k req/min) without a deploy.
•Return standard 429 + Retry-After so well-behaved clients back off.
•Apply a consistent algorithm (token bucket / sliding window) across all edge nodes.

Non-functional

•Sub-millisecond overhead per request. The limiter sits in front of every call and must be near-free.
•Accurate under bursts: counters must be atomic so concurrent requests can't both squeeze past the limit.
•Globally consistent counts across edge nodes, with no per-box-only view that lets a client get N× their limit.
•Define fail-open vs fail-closed: if the counter store is down, decide deliberately whether to allow or block.

API contract

internal: check(key, limit, window) → { allowed: bool, remaining, retry_after }

Called by the edge proxy per request; backed by an atomic Redis Lua script.

PUT /admin/limits { customer_id, limit, window } → 200

Updates a customer's tier at runtime; invalidates the cached rule.

429 Too Many Requests + Retry-After: <seconds>

The deny response contract the client is expected to honor.

About API Rate Limiter

You call an API a little too fast and suddenly get back a 429 Too Many Requests with a note telling you to wait a few seconds. That polite rejection is a rate limiter doing its job: protecting the service behind it from one client, buggy script, or scraper that would otherwise crowd out everyone else. It's a small system, which is exactly why it shows up so often, because it forces you to reason about atomicity, distributed state, and what to do when your own dependency fails.

Here is the whole thing in plain steps. Every request hits an edge proxy like Envoy or Nginx first. The proxy asks the Rate Limit Service whether this caller is allowed, passing a key such as the user_id or IP. The service looks up that customer's limit (free tier 100 req/min, enterprise 10k), then atomically increments the caller's counter in a Redis cluster using a Lua script. If the count is under the limit the proxy forwards the request to the origin service; if not, it returns 429 with a Retry-After header and the request never reaches your real API.

The reason the increment must be atomic is best seen with a turnstile. Picture two people trying to squeeze through a one-person gate at the exact same instant. If the gate checks 'is it free?' and only then locks, both can slip through together. A plain GET-then-SET has the same flaw: two requests both read the old count, both write the same new count, and the limit is silently breached. A Lua script runs as one indivisible step inside Redis, so nothing can sneak between the read and the write.

Two design decisions define this system. Counters are spread across the Redis cluster with consistent hashing, so adding or removing a node reshuffles only about 1/N of the keys instead of emptying every counter at once. And you must decide ahead of time what happens when Redis itself goes down: fail-open and allow everything, or fail-closed and deny everything. Most systems fail-open, because a brief abuse window is recoverable while rejecting every request is a self-inflicted outage, though cost-critical APIs sometimes choose the opposite. This system teaches atomic distributed counters, consistent hashing for low-churn rebalancing, token-bucket and sliding-window algorithms, and the deliberate fail-open versus fail-closed tradeoff.