Reliability·3 min read

Circuit Breaker

Stop calling a failing dependency before it takes you down with it.

Try it

Break the service, send requests, watch the breaker trip.

Client

BreakerCLOSEDrequests pass through

Service

healthy

When a dependency starts failing, the breaker stops calling it after a few failures (OPEN) so your service fails fast instead of piling up on a dead one. After a cooldown it tests the waters (HALF-OPEN) before fully trusting it again.

First time reading this? Start here

Plain English: when another service starts failing, stop calling it for a while. Otherwise your service piles up requests waiting for the broken one and crashes too. It's a fuse, exactly like the one in your house.

Used in:Payment Gateway Notification System

What it is

A stateful proxy in front of a downstream call. Three states: closed (calls flow through normally), open (calls fail fast without hitting the dependency), half-open (a few probe calls test whether the dependency has recovered). Transitions between states are driven by failure thresholds.

The problem it solves

When a downstream dependency is slow or down, naive callers keep retrying, exhausting their thread pools, eating their timeouts, and cascading the failure upward. A circuit breaker fails fast as soon as the dependency is sick, freeing your service to handle other work instead of getting stuck waiting on a broken downstream.

How it works

Track recent calls in a sliding window. If the failure rate exceeds a threshold (say 50% over the last 20 calls), trip the breaker: subsequent calls return an error immediately without touching the dependency. After a cooldown, allow a few probe calls; if they succeed, close the breaker. If they fail, stay open.

Why use it

Prevents cascading failures across services
Fails fast, freeing up threads/connections to do useful work
Probes detect recovery without slamming the recovering service

What it costs you

Breaker state is a coordination concern: per-instance vs cluster-wide
Thresholds need tuning per dependency (a flaky service might naturally fail 10% of the time)
Adds complexity; not worth it for in-process calls

Where it shows up in our architectures

Payment Gateway →
Implicit: Stripe calls in the Payment Service should be wrapped in a breaker to keep card-network slowness from cascading into checkout outages
Notification System →
Per-provider breakers (SendGrid, Twilio, APNs) keep one provider's outage from blocking the others

Gotchas

A breaker doesn't help if every request is independent and short-lived. Its value comes from preventing thread/connection exhaustion.
Coordinating breaker state across instances is hard. Per-instance breakers are simpler and usually enough.
Don't put a breaker around a call that can't fail (e.g. in-process function calls). It's only useful for calls with real failure modes.
Pair with retries carefully: a tight retry loop + a closed breaker = the failure pattern you were trying to avoid.

When this went wrong in production

AWS S3 us-east-1 melts the internet · 2017

Postmortem ↗

One typo in a routine S3 maintenance command took down half the internet for 4 hours.

An engineer ran a debug subcommand to remove a small number of capacity servers from S3 us-east-1. A typo expanded the scope to a much larger set, including servers running the index subsystem and placement subsystem. S3 lost the index → every read started failing. Cascading failure: every AWS service that depended on S3 (which was most of them: Lambda, ECS, CloudWatch, even the AWS Console) degraded. Took 4+ hours to restart the index subsystem because it hadn't been restarted at scale in years; the cold-start path itself was the bottleneck. The lesson: capacity-management commands need scope validation, AND your critical recovery paths need to be exercised regularly so they don't atrophy.

Cloudflare regex CPU-bomb · 2019

Postmortem ↗

A single bad regex took down ~all Cloudflare-fronted sites globally for 27 minutes.

Cloudflare's WAF (web application firewall) deployed a new rule containing a regex that exhibited catastrophic backtracking. On any HTTP request with the right pattern, the regex would run for seconds at 100% CPU on every CPU core globally. Within seconds, Cloudflare's edge fleet was CPU-saturated and unable to serve traffic. ~all Cloudflare-fronted sites went down. Rollback took 27 minutes because the deploy mechanism itself was struggling against the saturation. Lessons: never deploy untrusted regex globally without timeouts; staged rollout for any rule that runs on every request; the safety mechanism is only as good as your ability to actually deploy a rollback.

Knight Capital: $440M in 45 minutes · 2012

Postmortem ↗

A stale feature flag on one server bankrupted a 17-year-old trading firm in 45 minutes.

Knight Capital deployed new trading code to 8 servers, but missed one. That one server still had old code that, combined with a re-used feature flag from a long-retired test feature, started buying high and selling low on every trade it received. In 45 minutes the firm lost $440M, more than the company's entire net assets. The lesson: deploy automation that fails closed when a host doesn't ack. Feature flags should be deleted when their feature is retired, not left as time bombs. Anything that touches real money needs invariant checks the code can't bypass ('we should never buy 200% above market').

Fastly takes down the internet · 2021

Postmortem ↗

A customer config trigger crashed Fastly globally: 49 minutes, half the modern web dark.

Fastly had pushed a config update weeks earlier that introduced a latent bug, only triggered by a specific customer configuration pattern. When that customer eventually applied their config, the bug fired across Fastly's global edge fleet within 12 seconds. Reddit, the NYT, Amazon, the UK Gov website: all 503ing simultaneously. Recovery took 49 minutes because the rollback procedure itself depended on healthy edge nodes. The lesson: latent bugs triggered by customer input are essentially production bombs. Canary deployments must rotate, and your incident-response paths must work even when your data plane is on fire.

Discord's message queue backs up and drops 1M+ events · 2023

Postmortem ↗

A Cassandra compaction storm caused read latency to spike, backing up the message fanout queue until it overflowed.

Discord's message fanout pipeline copies messages to every online member's session via a Kafka-backed queue consumed by workers reading from Cassandra. During a Cassandra compaction event, read latency on that node spiked from single-digit milliseconds to hundreds. Workers waiting on Cassandra acks started piling up. The Kafka consumer group fell behind. Lag grew faster than workers could drain it. Discord's queue had a max-lag threshold: once crossed, older events were dropped to keep the pipeline from stalling permanently. Over 1 million message-delivery events were dropped. Users in large servers saw their friends' messages but not the server's activity feed. The lesson: consumer lag needs a circuit breaker, not a silent overflow. Treat Cassandra compaction like a planned partial-degradation, not a background task.

Slack's 5-hour outage from a cascading cache failure · 2022

Postmortem ↗

A cache misconfiguration caused a load spike that overwhelmed Slack's databases in sequence.

Slack deployed a Memcached configuration change that accidentally reduced the effective cache size. Requests that would have hit cache started hitting the database. The database absorbed the initial surge but latency crept up. Slower DB responses caused app servers to hold connections longer, exhausting their connection pools. Exhausted pools caused requests to queue. Queued requests timed out and clients retried, amplifying the load. The database load balancer fell over. Slack was effectively down for 5 hours for most users. The lesson: cache and database tiers aren't independent. A cache miss rate increase of just 5-10% can mean 10x database load on a busy system. Monitor cache hit rate as a first-class operational metric and have a circuit breaker for cache degradation.

Amazon Prime Day collapses under its own launch load · 2018

Postmortem ↗

Prime Day 2018 opened with Amazon's own landing page returning errors for the first 90 minutes.

Prime Day 2018 launched with a load spike Amazon had anticipated and prepared for, but not quite enough. The front-end tier scaled horizontally via auto-scaling groups. The recommendation service underneath did not: it depended on a Redis cluster sized for projected peak, not actual peak. The Redis cluster hit its connection limit within minutes of launch. Backend services queuing for Redis connections started timing out. The front-end returned errors. The recommendation service's circuit breaker was supposed to fail open (show a degraded UI without personalization), but configuration drift meant it was set to fail closed instead. Customers saw error dogs on Amazon.com for 90 minutes. The lesson: auto-scaling the frontend while leaving stateful dependencies unscaled is the most common Prime-Day-class mistake. Circuit breakers also need to be exercised in production, not just configured and forgotten.

Google Cloud networking failure: 4 hours, 3 regions · 2019

Postmortem ↗

A config push to the backbone control plane caused packet loss across three GCP regions for four hours.

Google pushed a config update to the network control plane managing inter-region backbone routing. The config included software that consumed far more memory than expected under production conditions, causing the control plane to crash on a large fraction of routers. Each restarting router needed to re-establish BGP peering, which consumed network capacity. Restarting routers and network traffic competing for bandwidth created a feedback loop: routers trying to recover caused more congestion, which slowed recovery further. Three GCP regions (us-east1, us-central1, europe-west1) experienced 30-87% packet loss for services using the Google backbone. The lesson: stage control plane changes and validate memory/resource usage before the push. A control plane change should never be able to create a data plane feedback loop.

Twitter's self-inflicted API shutdown · 2023

Twitter removed free API access with 48-hour notice, breaking thousands of apps and bots instantly.

In February 2023, Twitter/X announced it would end free API access with roughly 48 hours notice, requiring all developers to move to paid tiers. This wasn't an outage in the traditional sense, but the outcome was the same: thousands of Twitter-integrated apps, bots, academic tools, and emergency-alert services stopped working simultaneously. Wildfire alert bots, public transit notification bots, journalism tools: all went dark. The lesson is about API contract stability, not fault tolerance. If you build on a third-party API, treat their rate limits and pricing as a failure mode, not a constant. Design your system so that a third-party API becoming unavailable or prohibitively expensive doesn't cascade into a user-facing outage.

DoorDash Redis cluster overload cascades to full outage · 2021

Postmortem ↗

A single Redis cluster used for rate limiting became a cascading single point of failure during peak dinner hours.

DoorDash used a central Redis cluster to store rate limiting counters. During a high-traffic event, the Redis cluster started showing elevated latency. Services calling the rate limiter were waiting on Redis responses and holding threads. Thread pools exhausted. Services started returning 503s. Upstream services receiving errors started retrying, amplifying the load. The failure cascaded horizontally: order placement, merchant dashboards, and driver assignment all went down because they all shared the same rate-limiter Redis cluster. This dependency didn't appear in any single team's architecture diagram. The lesson: shared infrastructure like rate limiters must be treated as SLO-critical with blast-radius isolation. If rate limiting fails, it should fail open, not block the entire request path.

Netflix's Christmas Eve outage that built Chaos Engineering · 2012

Postmortem ↗

An ELB failure in AWS us-east-1 took Netflix down on Christmas Eve, directly causing Chaos Monkey's creation.

On Christmas Eve 2012, Amazon suffered an ELB (Elastic Load Balancer) failure in us-east-1. Netflix's entire streaming infrastructure ran out of a single AWS region at the time. The ELB failure cascaded through Netflix's stack: API requests failed, playback failed, and millions of customers couldn't watch Netflix on Christmas Eve. Netflix's failover plan existed on paper but had never been exercised under real conditions. It failed. The incident directly caused Netflix to accelerate their multi-region active-active migration and to build Chaos Kong, the tool that kills an entire AWS region in production on a regular schedule, to ensure the failover path never goes stale. The lesson: a failover plan that's never been tested is a guess. You only trust what you've actually run.

All war stories →

Interview angle

Circuit breakers come up when an interviewer asks what happens when a downstream service is slow. The key thing to say is that timeouts alone are not enough because slow calls still consume threads, and enough of them starve your entire service. Candidates lose points by stopping at 'add a timeout' without explaining how you detect systemic failure and stop sending traffic.

Your notes

Private to you