Real outages, one paragraph each. The lesson is the point. 7 entries.
A customer config trigger crashed Fastly globally: 49 minutes, half the modern web dark.
Fastly had pushed a config update weeks earlier that introduced a latent bug, only triggered by a specific customer configuration pattern. When that customer eventually applied their config, the bug fired across Fastly's global edge fleet within 12 seconds. Reddit, the NYT, Amazon, the UK Gov website: all 503ing simultaneously. Recovery took 49 minutes because the rollback procedure itself depended on healthy edge nodes. The lesson: latent bugs triggered by customer input are essentially production bombs. Canary deployments must rotate, and your incident-response paths must work even when your data plane is on fire.
A BGP misconfig wiped Facebook from the internet for 6 hours, including their badge access.
A routine command intended to assess global backbone capacity was issued, but a bug in Facebook's audit tool failed to stop it. The command withdrew all Facebook BGP routes, taking the company off the internet. Worse: the same DNS infrastructure that announced their existence to the world also gated their internal tools, including the badge-access system at the datacenters. Engineers couldn't VPN in, couldn't open the doors, couldn't even reach the management plane to roll back. Recovery required physically driving engineers to the datacenter floor. The lesson: never let your control plane depend on your data plane. Out-of-band access has to actually be out-of-band.
A single bad regex took down ~all Cloudflare-fronted sites globally for 27 minutes.
Cloudflare's WAF (web application firewall) deployed a new rule containing a regex that exhibited catastrophic backtracking. On any HTTP request with the right pattern, the regex would run for seconds at 100% CPU on every CPU core globally. Within seconds, Cloudflare's edge fleet was CPU-saturated and unable to serve traffic. ~all Cloudflare-fronted sites went down. Rollback took 27 minutes because the deploy mechanism itself was struggling against the saturation. Lessons: never deploy untrusted regex globally without timeouts; staged rollout for any rule that runs on every request; the safety mechanism is only as good as your ability to actually deploy a rollback.
A 43-second network partition triggered 24 hours of data inconsistency.
A 43-second network partition between GitHub's US-East and US-West data centers caused MySQL clusters in both regions to elect themselves primary (split-brain). When the partition healed, both regions had accepted writes and now had divergent state. GitHub chose consistency over availability: they took the service degraded for 24+ hours while they manually reconciled the diverged writes across clusters. The lesson: CAP isn't a textbook curiosity. When the partition heals, you've already made the C-vs-A choice. Your reconciliation strategy IS your CAP choice expressed in code.
One typo in a routine S3 maintenance command took down half the internet for 4 hours.
An engineer ran a debug subcommand to remove a small number of capacity servers from S3 us-east-1. A typo expanded the scope to a much larger set, including servers running the index subsystem and placement subsystem. S3 lost the index → every read started failing. Cascading failure: every AWS service that depended on S3 (which was most of them: Lambda, ECS, CloudWatch, even the AWS Console) degraded. Took 4+ hours to restart the index subsystem because it hadn't been restarted at scale in years; the cold-start path itself was the bottleneck. The lesson: capacity-management commands need scope validation, AND your critical recovery paths need to be exercised regularly so they don't atrophy.
An engineer ran rm -rf on the wrong database during a 1am incident response.
At 1am, GitLab's secondary database was lagging badly. An engineer trying to fix replication accidentally ran `rm -rf` on the PRIMARY database directory instead of the secondary. Production data, gone. They had FIVE backup mechanisms in place: snapshots, dumps, replicas, etc. Four of them were silently broken or empty. The fifth had a 6-hour-old backup. They lost 6 hours of project data, 5,000 projects, 5,000 comments, 700 new users. The lesson: backups that aren't tested aren't backups. Restore drills are not optional. They're the only thing that proves your backup strategy works.
A stale feature flag on one server bankrupted a 17-year-old trading firm in 45 minutes.
Knight Capital deployed new trading code to 8 servers, but missed one. That one server still had old code that, combined with a re-used feature flag from a long-retired test feature, started buying high and selling low on every trade it received. In 45 minutes the firm lost $440M, more than the company's entire net assets. The lesson: deploy automation that fails closed when a host doesn't ack. Feature flags should be deleted when their feature is retired, not left as time bombs. Anything that touches real money needs invariant checks the code can't bypass ('we should never buy 200% above market').