Performance·3 min read

Load Balancing

Spread incoming requests across a pool of identical servers so no single one melts.

Try it

Send traffic. Click a server to kill it and watch the balancer route around it.

Load Balancer

The balancer spreads requests evenly (round-robin here) so no single server is overwhelmed. Kill one and traffic simply flows to the survivors, which is how a pool of cheap servers stays available even when machines die.

First time reading this? Start here

Plain English: instead of sending every request to one server, put a traffic cop in front of N identical servers and have it spread the load. The traffic cop is the load balancer.

Used in:API Rate Limiter Uber

What it is

A component that sits in front of a pool of backend servers and distributes incoming requests across them. The pool looks like one big server to clients; behind the scenes, the load balancer picks which actual instance handles each request.

The problem it solves

Any service with more traffic than a single box can handle needs to spread that traffic somehow. Without a load balancer, you'd hard-code multiple endpoints into clients (terrible for ops) or rely on DNS round-robin (poor failure handling, no real-time adjustment).

How it works

Common strategies: round-robin (next-in-line), least-connections (newest request to the box with fewest open connections), weighted (bigger boxes get more), consistent hashing (sticky routing for sessions or caches). The LB also health-checks each backend and routes around failures.

Why use it

Horizontal scaling: add boxes, traffic spreads automatically
Failure tolerance: sick boxes get pulled from rotation
Single endpoint for clients; backends are abstracted away

What it costs you

LB itself is a SPOF unless run as a pair / fleet
Sticky sessions complicate scaling and are best avoided when possible
Layer 4 (TCP) vs Layer 7 (HTTP) is a real tradeoff in flexibility vs performance

Where it shows up in our architectures

API Rate Limiter →
Edge proxy is essentially a load balancer with a rate-limit hook
Uber →
Implicit in front of every service tier, where Dispatch is horizontally scaled

Gotchas

Stateless backends are dramatically easier to load-balance. If your backend has per-instance state, you've imported all the problems sticky sessions cause.
Layer 4 LB is faster but can't make routing decisions based on URL/headers. Layer 7 (Envoy, Nginx) is slower but gives you path-based routing, header rewrites, etc.
Health checks should hit a real endpoint that exercises the dependencies. A '/health' that always returns 200 will happily route traffic to a box that can't reach the database.

When this went wrong in production

Cloudflare regex CPU-bomb · 2019

Postmortem ↗

A single bad regex took down ~all Cloudflare-fronted sites globally for 27 minutes.

Cloudflare's WAF (web application firewall) deployed a new rule containing a regex that exhibited catastrophic backtracking. On any HTTP request with the right pattern, the regex would run for seconds at 100% CPU on every CPU core globally. Within seconds, Cloudflare's edge fleet was CPU-saturated and unable to serve traffic. ~all Cloudflare-fronted sites went down. Rollback took 27 minutes because the deploy mechanism itself was struggling against the saturation. Lessons: never deploy untrusted regex globally without timeouts; staged rollout for any rule that runs on every request; the safety mechanism is only as good as your ability to actually deploy a rollback.

Slack's 5-hour outage from a cascading cache failure · 2022

Postmortem ↗

A cache misconfiguration caused a load spike that overwhelmed Slack's databases in sequence.

Slack deployed a Memcached configuration change that accidentally reduced the effective cache size. Requests that would have hit cache started hitting the database. The database absorbed the initial surge but latency crept up. Slower DB responses caused app servers to hold connections longer, exhausting their connection pools. Exhausted pools caused requests to queue. Queued requests timed out and clients retried, amplifying the load. The database load balancer fell over. Slack was effectively down for 5 hours for most users. The lesson: cache and database tiers aren't independent. A cache miss rate increase of just 5-10% can mean 10x database load on a busy system. Monitor cache hit rate as a first-class operational metric and have a circuit breaker for cache degradation.

Google Cloud networking failure: 4 hours, 3 regions · 2019

Postmortem ↗

A config push to the backbone control plane caused packet loss across three GCP regions for four hours.

Google pushed a config update to the network control plane managing inter-region backbone routing. The config included software that consumed far more memory than expected under production conditions, causing the control plane to crash on a large fraction of routers. Each restarting router needed to re-establish BGP peering, which consumed network capacity. Restarting routers and network traffic competing for bandwidth created a feedback loop: routers trying to recover caused more congestion, which slowed recovery further. Three GCP regions (us-east1, us-central1, europe-west1) experienced 30-87% packet loss for services using the Google backbone. The lesson: stage control plane changes and validate memory/resource usage before the push. A control plane change should never be able to create a data plane feedback loop.

All war stories →

Interview angle

Load balancing comes up in almost every interview as a given, but the signal interviewers want is whether you can go beyond 'just add a load balancer.' Specifically, they want you to name the algorithm (least-connections for long-lived requests, round-robin for stateless short calls) and flag that sticky sessions are a trap. Candidates lose points by treating a load balancer as free horizontal scale without addressing session state or health check depth.

Your notes

Private to you