← Concepts
Reliability·3 min read

Circuit Breaker

Stop calling a failing dependency before it takes you down with it.

First time reading this? Start here

Plain English: when another service starts failing, stop calling it for a while. Otherwise your service piles up requests waiting for the broken one and crashes too. It's a fuse, exactly like the one in your house.

Used in:Payment GatewayNotification System
What it is

A stateful proxy in front of a downstream call. Three states: closed (calls flow through normally), open (calls fail fast without hitting the dependency), half-open (a few probe calls test whether the dependency has recovered). Transitions between states are driven by failure thresholds.

The problem it solves

When a downstream dependency is slow or down, naive callers keep retrying, exhausting their thread pools, eating their timeouts, and cascading the failure upward. A circuit breaker fails fast as soon as the dependency is sick, freeing your service to handle other work instead of getting stuck waiting on a broken downstream.

How it works

Track recent calls in a sliding window. If the failure rate exceeds a threshold (say 50% over the last 20 calls), trip the breaker: subsequent calls return an error immediately without touching the dependency. After a cooldown, allow a few probe calls; if they succeed, close the breaker. If they fail, stay open.

Why use it
What it costs you
Where it shows up in our architectures
Gotchas
When this went wrong in production

AWS S3 us-east-1 melts the internet · 2017

Postmortem ↗

One typo in a routine S3 maintenance command took down half the internet for 4 hours.

An engineer ran a debug subcommand to remove a small number of capacity servers from S3 us-east-1. A typo expanded the scope to a much larger set, including servers running the index subsystem and placement subsystem. S3 lost the index → every read started failing. Cascading failure: every AWS service that depended on S3 (which was most of them: Lambda, ECS, CloudWatch, even the AWS Console) degraded. Took 4+ hours to restart the index subsystem because it hadn't been restarted at scale in years; the cold-start path itself was the bottleneck. The lesson: capacity-management commands need scope validation, AND your critical recovery paths need to be exercised regularly so they don't atrophy.

Cloudflare regex CPU-bomb · 2019

Postmortem ↗

A single bad regex took down ~all Cloudflare-fronted sites globally for 27 minutes.

Cloudflare's WAF (web application firewall) deployed a new rule containing a regex that exhibited catastrophic backtracking. On any HTTP request with the right pattern, the regex would run for seconds at 100% CPU on every CPU core globally. Within seconds, Cloudflare's edge fleet was CPU-saturated and unable to serve traffic. ~all Cloudflare-fronted sites went down. Rollback took 27 minutes because the deploy mechanism itself was struggling against the saturation. Lessons: never deploy untrusted regex globally without timeouts; staged rollout for any rule that runs on every request; the safety mechanism is only as good as your ability to actually deploy a rollback.

Knight Capital: $440M in 45 minutes · 2012

Postmortem ↗

A stale feature flag on one server bankrupted a 17-year-old trading firm in 45 minutes.

Knight Capital deployed new trading code to 8 servers, but missed one. That one server still had old code that, combined with a re-used feature flag from a long-retired test feature, started buying high and selling low on every trade it received. In 45 minutes the firm lost $440M, more than the company's entire net assets. The lesson: deploy automation that fails closed when a host doesn't ack. Feature flags should be deleted when their feature is retired, not left as time bombs. Anything that touches real money needs invariant checks the code can't bypass ('we should never buy 200% above market').

Fastly takes down the internet · 2021

Postmortem ↗

A customer config trigger crashed Fastly globally: 49 minutes, half the modern web dark.

Fastly had pushed a config update weeks earlier that introduced a latent bug, only triggered by a specific customer configuration pattern. When that customer eventually applied their config, the bug fired across Fastly's global edge fleet within 12 seconds. Reddit, the NYT, Amazon, the UK Gov website: all 503ing simultaneously. Recovery took 49 minutes because the rollback procedure itself depended on healthy edge nodes. The lesson: latent bugs triggered by customer input are essentially production bombs. Canary deployments must rotate, and your incident-response paths must work even when your data plane is on fire.

Your notes

Private to you