Deep DiveMay 23, 2026·9 min read

How Rate Limiting Actually Works at Scale

Token buckets, sliding windows, Redis counters, and the distributed rate limiting problem nobody talks about in interviews.

rate limitingalgorithmsredisdistributed systemsapi design

Rate limiting is one of those systems that sounds simple (allow N requests per second) until you start implementing it across a distributed fleet and realize that 'N requests per second' has at least four different interpretations, each with different implementation requirements and failure modes.

Algorithm 1: Token Bucket

The token bucket algorithm models a bucket that holds up to N tokens and refills at a rate of R tokens per second. Each request consumes one token. If the bucket is empty, the request is rejected or queued. The key property: bursting is allowed up to the bucket capacity. A client that has been idle for 10 seconds can send 10 requests instantly (if N=10), then must wait for the refill rate. This is the right model for APIs where occasional bursts should be accommodated. It reflects real user behavior. Cloudflare, AWS API Gateway, and Stripe's API all use token bucket variants.

Algorithm 2: Sliding Window Counter

Fixed window counting is simple (count requests in the current minute, reset at :00) but has a boundary problem: a client can send 100 requests in the last second of minute 1 and 100 in the first second of minute 2. That's 200 requests in a 2-second window while the limit is 100/minute. The sliding window counter fixes this by weighting the previous window's count based on how much of it overlaps the current window. If you are 25% into the current minute, the effective count is 0.75 × previous_minute_count + current_minute_count. This gives smooth limiting behavior at the boundary with only two counters stored per key.

Implementing with Redis

A single-node Redis rate limiter using INCR and EXPIRE is straightforward: increment the counter for the (user_id, current_window) key, set the TTL on the first request of each window. But the INCR and EXPIRE are two separate commands, not atomic. A crash between them leaves a counter with no TTL that never expires. The fix: use Lua scripts (which Redis executes atomically) or the newer Redis SET with NX and PX flags to set value and TTL atomically. For token bucket semantics, Redis 7's built-in rate limiting module handles this natively.

The distributed rate limiting problem

With a fleet of 50 API servers, each maintaining its own in-process counter, the effective limit is 50N: each instance allows N requests independently. A client who routes all requests through one instance hits the local limit; one who distributes across all instances gets 50x the intended limit. The solution is a centralized counter in Redis, but this adds a network round-trip to every API request. The practical compromise is a local + global hybrid: each instance holds a local token bucket with a small capacity (burst absorption) and periodically syncs with a shared Redis counter. The local bucket handles microsecond traffic smoothing; the Redis counter enforces the global rate. Lyft's ratelimit service and Envoy's rate limiting filter use this architecture.

What happens when Redis is down

The most important rate limiter design question is: what should happen when the rate limit store is unavailable? Fail open (allow all requests) preserves availability: your service stays up but loses rate limiting. Fail closed (reject all requests) protects your backend but makes your service unavailable to legitimate users. The right answer is almost always fail open for external-facing APIs (a DoS attack on your Redis doesn't become a self-inflicted outage), and fail closed for internal rate limits that protect critical resources (payment processors, SMS APIs with per-message costs). Make the choice explicit in your design.

Explore the concepts

Consistent Hashing →Caching →

See it in action

API Rate Limiter →

← Back to all articles