Stop calling a failing dependency before it takes you down with it.
Plain English: when another service starts failing, stop calling it for a while. Otherwise your service piles up requests waiting for the broken one and crashes too. It's a fuse, exactly like the one in your house.
A stateful proxy in front of a downstream call. Three states: closed (calls flow through normally), open (calls fail fast without hitting the dependency), half-open (a few probe calls test whether the dependency has recovered). Transitions between states are driven by failure thresholds.
When a downstream dependency is slow or down, naive callers keep retrying, exhausting their thread pools, eating their timeouts, and cascading the failure upward. A circuit breaker fails fast as soon as the dependency is sick, freeing your service to handle other work instead of getting stuck waiting on a broken downstream.
Track recent calls in a sliding window. If the failure rate exceeds a threshold (say 50% over the last 20 calls), trip the breaker: subsequent calls return an error immediately without touching the dependency. After a cooldown, allow a few probe calls; if they succeed, close the breaker. If they fail, stay open.
Implicit: Stripe calls in the Payment Service should be wrapped in a breaker to keep card-network slowness from cascading into checkout outages
Per-provider breakers (SendGrid, Twilio, APNs) keep one provider's outage from blocking the others
One typo in a routine S3 maintenance command took down half the internet for 4 hours.
An engineer ran a debug subcommand to remove a small number of capacity servers from S3 us-east-1. A typo expanded the scope to a much larger set, including servers running the index subsystem and placement subsystem. S3 lost the index → every read started failing. Cascading failure: every AWS service that depended on S3 (which was most of them: Lambda, ECS, CloudWatch, even the AWS Console) degraded. Took 4+ hours to restart the index subsystem because it hadn't been restarted at scale in years; the cold-start path itself was the bottleneck. The lesson: capacity-management commands need scope validation, AND your critical recovery paths need to be exercised regularly so they don't atrophy.
A single bad regex took down ~all Cloudflare-fronted sites globally for 27 minutes.
Cloudflare's WAF (web application firewall) deployed a new rule containing a regex that exhibited catastrophic backtracking. On any HTTP request with the right pattern, the regex would run for seconds at 100% CPU on every CPU core globally. Within seconds, Cloudflare's edge fleet was CPU-saturated and unable to serve traffic. ~all Cloudflare-fronted sites went down. Rollback took 27 minutes because the deploy mechanism itself was struggling against the saturation. Lessons: never deploy untrusted regex globally without timeouts; staged rollout for any rule that runs on every request; the safety mechanism is only as good as your ability to actually deploy a rollback.
A stale feature flag on one server bankrupted a 17-year-old trading firm in 45 minutes.
Knight Capital deployed new trading code to 8 servers, but missed one. That one server still had old code that, combined with a re-used feature flag from a long-retired test feature, started buying high and selling low on every trade it received. In 45 minutes the firm lost $440M, more than the company's entire net assets. The lesson: deploy automation that fails closed when a host doesn't ack. Feature flags should be deleted when their feature is retired, not left as time bombs. Anything that touches real money needs invariant checks the code can't bypass ('we should never buy 200% above market').
A customer config trigger crashed Fastly globally: 49 minutes, half the modern web dark.
Fastly had pushed a config update weeks earlier that introduced a latent bug, only triggered by a specific customer configuration pattern. When that customer eventually applied their config, the bug fired across Fastly's global edge fleet within 12 seconds. Reddit, the NYT, Amazon, the UK Gov website: all 503ing simultaneously. Recovery took 49 minutes because the rollback procedure itself depended on healthy edge nodes. The lesson: latent bugs triggered by customer input are essentially production bombs. Canary deployments must rotate, and your incident-response paths must work even when your data plane is on fire.