War stories

Real outages, one paragraph each. The lesson is the point. 20 entries.

Discord's message queue backs up and drops 1M+ events · 2023
Postmortem ↗
A Cassandra compaction storm caused read latency to spike, backing up the message fanout queue until it overflowed.
Discord's message fanout pipeline copies messages to every online member's session via a Kafka-backed queue consumed by workers reading from Cassandra. During a Cassandra compaction event, read latency on that node spiked from single-digit milliseconds to hundreds. Workers waiting on Cassandra acks started piling up. The Kafka consumer group fell behind. Lag grew faster than workers could drain it. Discord's queue had a max-lag threshold: once crossed, older events were dropped to keep the pipeline from stalling permanently. Over 1 million message-delivery events were dropped. Users in large servers saw their friends' messages but not the server's activity feed. The lesson: consumer lag needs a circuit breaker, not a silent overflow. Treat Cassandra compaction like a planned partial-degradation, not a background task.
Message Queues Circuit Breaker Consistent Hashing
Cloudflare Workers KV stale reads for 35 minutes · 2023
Postmortem ↗
A replication topology change made Workers KV return data that was hours old globally.
Cloudflare Workers KV is a globally distributed key-value store built on eventual consistency: writes propagate to all edges within roughly 60 seconds. During a maintenance operation, an engineer changed the replication topology, specifically which nodes a region's reads fall back to on cache miss. The change accidentally routed reads for a subset of keys to a secondary tier that had stopped receiving updates. Edge nodes across all regions started serving stale values that were hours old, not seconds old. Feature flags, A/B test configs, and auth tokens stored in KV returned wrong results for 35 minutes. The lesson: eventually consistent systems have a defined propagation bound. Any change to the replication topology must be validated against that bound. Breaking propagation doesn't produce errors; it produces silent staleness that can persist indefinitely.
Replication CDN (Content Delivery Network)Caching
Twitter's self-inflicted API shutdown · 2023
Twitter removed free API access with 48-hour notice, breaking thousands of apps and bots instantly.
In February 2023, Twitter/X announced it would end free API access with roughly 48 hours notice, requiring all developers to move to paid tiers. This wasn't an outage in the traditional sense, but the outcome was the same: thousands of Twitter-integrated apps, bots, academic tools, and emergency-alert services stopped working simultaneously. Wildfire alert bots, public transit notification bots, journalism tools: all went dark. The lesson is about API contract stability, not fault tolerance. If you build on a third-party API, treat their rate limits and pricing as a failure mode, not a constant. Design your system so that a third-party API becoming unavailable or prohibitively expensive doesn't cascade into a user-facing outage.
API Gateway Circuit Breaker
Azure Active Directory outage: MFA breaks for 14 hours · 2023
Postmortem ↗
A corrupted database update took down Azure AD MFA globally, locking millions of users out of Microsoft services.
In September 2023, a routine Azure Active Directory update introduced a corrupted data entry into the authentication service's configuration store. That store is read on every MFA request, so within minutes of deployment, MFA was failing globally. Services that depend on Azure AD for login, including Microsoft 365, Teams, Azure Portal, and Xbox, all started rejecting multi-factor auth. Because MFA was broken, engineers trying to reach the management plane to roll back had to use break-glass procedures. The update was eventually rolled back, but re-validation and cache clearing across global infrastructure took 14 hours. The lesson: configuration stores are critical path for every request. Changes to them must be validated on live traffic via canary before global rollout. Break-glass procedures must not depend on the service they're trying to fix.
Authentication vs Authorization Replication Caching
Google Docs deletes documents for 0.001% of users · 2023
Postmortem ↗
A storage migration bug silently deleted the document content for a small fraction of Google Docs users.
During a backend storage migration for Google Drive, a race condition in the migration code permanently deleted document contents for a small fraction of users, roughly 0.001% of the user base, which still represents hundreds of thousands of documents. The deletion was silent: Google Drive kept showing the document title and metadata, but opening the document showed blank content. Users didn't realize the content was gone right away, which delayed support tickets. Recovery was partial. Google's cross-region replication had the content, but the deletion had already propagated before the replication lag resolved, making some recent edits unrecoverable. The lesson: data migrations must be zero-destructive. Write, then verify, then delete, with a flag that can be rolled back. Replication protects against node failure, not application-layer bugs that replicate the bug to every replica.
Replication Distributed Tracing Observability (Logs, Metrics, Traces)
Slack's 5-hour outage from a cascading cache failure · 2022
Postmortem ↗
A cache misconfiguration caused a load spike that overwhelmed Slack's databases in sequence.
Slack deployed a Memcached configuration change that accidentally reduced the effective cache size. Requests that would have hit cache started hitting the database. The database absorbed the initial surge but latency crept up. Slower DB responses caused app servers to hold connections longer, exhausting their connection pools. Exhausted pools caused requests to queue. Queued requests timed out and clients retried, amplifying the load. The database load balancer fell over. Slack was effectively down for 5 hours for most users. The lesson: cache and database tiers aren't independent. A cache miss rate increase of just 5-10% can mean 10x database load on a busy system. Monitor cache hit rate as a first-class operational metric and have a circuit breaker for cache degradation.
Caching Circuit Breaker Load Balancing Cache Eviction & Write Policies
Heroku's OAuth token breach: 7-week secret exposure · 2022
Postmortem ↗
An attacker stole Heroku's GitHub OAuth tokens and downloaded private repos for 7 weeks undetected.
In April 2022, GitHub notified Heroku that OAuth tokens from Heroku's GitHub integration had been used to access private repositories, including Salesforce's own internal infrastructure repos. The attacker got the tokens from an internal Heroku database. They'd been stored with insufficient encryption and the attacker had access for roughly 7 weeks before detection. Because the access pattern (a token reading repos it had authorized access to) looked like normal OAuth usage, existing monitoring never flagged it. Heroku revoked all OAuth tokens, breaking GitHub integrations for every Heroku customer. The lesson: OAuth tokens at rest are secrets. Encrypt them with envelope encryption and rotate them. Anomaly detection for authorization must use behavioral baselines, not just permission checks.
Authentication vs Authorization Encryption in Transit & at Rest
Fastly takes down the internet · 2021
Postmortem ↗
A customer config trigger crashed Fastly globally: 49 minutes, half the modern web dark.
Fastly had pushed a config update weeks earlier that introduced a latent bug, only triggered by a specific customer configuration pattern. When that customer eventually applied their config, the bug fired across Fastly's global edge fleet within 12 seconds. Reddit, the NYT, Amazon, the UK Gov website: all 503ing simultaneously. Recovery took 49 minutes because the rollback procedure itself depended on healthy edge nodes. The lesson: latent bugs triggered by customer input are essentially production bombs. Canary deployments must rotate, and your incident-response paths must work even when your data plane is on fire.
Circuit Breaker CDN (Content Delivery Network)
Facebook locks itself out of its own datacenter · 2021
Postmortem ↗
A BGP misconfig wiped Facebook from the internet for 6 hours, including their badge access.
A routine command intended to assess global backbone capacity was issued, but a bug in Facebook's audit tool failed to stop it. The command withdrew all Facebook BGP routes, taking the company off the internet. Worse: the same DNS infrastructure that announced their existence to the world also gated their internal tools, including the badge-access system at the datacenters. Engineers couldn't VPN in, couldn't open the doors, couldn't even reach the management plane to roll back. Recovery required physically driving engineers to the datacenter floor. The lesson: never let your control plane depend on your data plane. Out-of-band access has to actually be out-of-band.
DNS (Domain Name System)
DoorDash Redis cluster overload cascades to full outage · 2021
Postmortem ↗
A single Redis cluster used for rate limiting became a cascading single point of failure during peak dinner hours.
DoorDash used a central Redis cluster to store rate limiting counters. During a high-traffic event, the Redis cluster started showing elevated latency. Services calling the rate limiter were waiting on Redis responses and holding threads. Thread pools exhausted. Services started returning 503s. Upstream services receiving errors started retrying, amplifying the load. The failure cascaded horizontally: order placement, merchant dashboards, and driver assignment all went down because they all shared the same rate-limiter Redis cluster. This dependency didn't appear in any single team's architecture diagram. The lesson: shared infrastructure like rate limiters must be treated as SLO-critical with blast-radius isolation. If rate limiting fails, it should fail open, not block the entire request path.
Circuit Breaker Caching
Zoom's certificate expiry takes down 300M daily meeting users · 2020
Postmortem ↗
An expired TLS certificate silently broke Zoom for millions of users for 2 hours before anyone noticed.
In August 2020, a TLS certificate used by Zoom's authentication infrastructure expired without being renewed. Certificate expiry doesn't produce a loud failure. Clients simply can't establish a TLS handshake and get a connection error. For users, this looked like the Zoom app failing to log in or freezing on the meeting join screen. Because the failure was a silent TLS error rather than an obvious application exception, Zoom's monitoring didn't alert for nearly 30 minutes. 300M daily users were affected. The lesson: certificate expiry is one of the most predictable outages in existence. Monitor cert expiry as an SLO metric. Alert at 30 days, page at 7, and auto-renew by default. Let's Encrypt and AWS ACM exist precisely to make manual renewal unnecessary.
Encryption in Transit & at Rest
Cloudflare regex CPU-bomb · 2019
Postmortem ↗
A single bad regex took down ~all Cloudflare-fronted sites globally for 27 minutes.
Cloudflare's WAF (web application firewall) deployed a new rule containing a regex that exhibited catastrophic backtracking. On any HTTP request with the right pattern, the regex would run for seconds at 100% CPU on every CPU core globally. Within seconds, Cloudflare's edge fleet was CPU-saturated and unable to serve traffic. ~all Cloudflare-fronted sites went down. Rollback took 27 minutes because the deploy mechanism itself was struggling against the saturation. Lessons: never deploy untrusted regex globally without timeouts; staged rollout for any rule that runs on every request; the safety mechanism is only as good as your ability to actually deploy a rollback.
Circuit Breaker Load Balancing
Google Cloud networking failure: 4 hours, 3 regions · 2019
Postmortem ↗
A config push to the backbone control plane caused packet loss across three GCP regions for four hours.
Google pushed a config update to the network control plane managing inter-region backbone routing. The config included software that consumed far more memory than expected under production conditions, causing the control plane to crash on a large fraction of routers. Each restarting router needed to re-establish BGP peering, which consumed network capacity. Restarting routers and network traffic competing for bandwidth created a feedback loop: routers trying to recover caused more congestion, which slowed recovery further. Three GCP regions (us-east1, us-central1, europe-west1) experienced 30-87% packet loss for services using the Google backbone. The lesson: stage control plane changes and validate memory/resource usage before the push. A control plane change should never be able to create a data plane feedback loop.
DNS (Domain Name System)Load Balancing Circuit Breaker
GitHub 24-hour partition · 2018
Postmortem ↗
A 43-second network partition triggered 24 hours of data inconsistency.
A 43-second network partition between GitHub's US-East and US-West data centers caused MySQL clusters in both regions to elect themselves primary (split-brain). When the partition healed, both regions had accepted writes and now had divergent state. GitHub chose consistency over availability: they took the service degraded for 24+ hours while they manually reconciled the diverged writes across clusters. The lesson: CAP isn't a textbook curiosity. When the partition heals, you've already made the C-vs-A choice. Your reconciliation strategy IS your CAP choice expressed in code.
CAP Theorem Replication Quorum
Amazon Prime Day collapses under its own launch load · 2018
Postmortem ↗
Prime Day 2018 opened with Amazon's own landing page returning errors for the first 90 minutes.
Prime Day 2018 launched with a load spike Amazon had anticipated and prepared for, but not quite enough. The front-end tier scaled horizontally via auto-scaling groups. The recommendation service underneath did not: it depended on a Redis cluster sized for projected peak, not actual peak. The Redis cluster hit its connection limit within minutes of launch. Backend services queuing for Redis connections started timing out. The front-end returned errors. The recommendation service's circuit breaker was supposed to fail open (show a degraded UI without personalization), but configuration drift meant it was set to fail closed instead. Customers saw error dogs on Amazon.com for 90 minutes. The lesson: auto-scaling the frontend while leaving stateful dependencies unscaled is the most common Prime-Day-class mistake. Circuit breakers also need to be exercised in production, not just configured and forgotten.
Auto-Scaling Circuit Breaker Caching
AWS S3 us-east-1 melts the internet · 2017
Postmortem ↗
One typo in a routine S3 maintenance command took down half the internet for 4 hours.
An engineer ran a debug subcommand to remove a small number of capacity servers from S3 us-east-1. A typo expanded the scope to a much larger set, including servers running the index subsystem and placement subsystem. S3 lost the index → every read started failing. Cascading failure: every AWS service that depended on S3 (which was most of them: Lambda, ECS, CloudWatch, even the AWS Console) degraded. Took 4+ hours to restart the index subsystem because it hadn't been restarted at scale in years; the cold-start path itself was the bottleneck. The lesson: capacity-management commands need scope validation, AND your critical recovery paths need to be exercised regularly so they don't atrophy.
Circuit Breaker Replication Sharding / Data Partitioning
GitLab database deletion · 2017
Postmortem ↗
An engineer ran rm -rf on the wrong database during a 1am incident response.
At 1am, GitLab's secondary database was lagging badly. An engineer trying to fix replication accidentally ran `rm -rf` on the PRIMARY database directory instead of the secondary. Production data, gone. They had FIVE backup mechanisms in place: snapshots, dumps, replicas, etc. Four of them were silently broken or empty. The fifth had a 6-hour-old backup. They lost 6 hours of project data, 5,000 projects, 5,000 comments, 700 new users. The lesson: backups that aren't tested aren't backups. Restore drills are not optional. They're the only thing that proves your backup strategy works.
Replication
Stripe double-charges thousands of customers · 2016
Postmortem ↗
A race condition in charge creation caused duplicate charges when clients retried on a slow response.
Stripe's charge API occasionally returned a timeout to clients. The HTTP connection dropped before the response arrived, even though the charge had already been created on Stripe's side. Well-behaved clients, following Stripe's own retry guidance, retried the request. Without idempotency keys, Stripe's backend treated the retry as a new charge and created a second one. Thousands of customers were double-billed before the incident was caught. Stripe rolled out idempotency key enforcement as a first-class API primitive: clients send a unique key per intended charge, and the backend deduplicates on that key no matter how many times the request arrives. The lesson: any operation that charges money, sends a message, or has real-world side effects must be idempotent end-to-end. Timeouts aren't errors; they're ambiguous. Design your API for that ambiguity.
Idempotency
Knight Capital: $440M in 45 minutes · 2012
Postmortem ↗
A stale feature flag on one server bankrupted a 17-year-old trading firm in 45 minutes.
Knight Capital deployed new trading code to 8 servers, but missed one. That one server still had old code that, combined with a re-used feature flag from a long-retired test feature, started buying high and selling low on every trade it received. In 45 minutes the firm lost $440M, more than the company's entire net assets. The lesson: deploy automation that fails closed when a host doesn't ack. Feature flags should be deleted when their feature is retired, not left as time bombs. Anything that touches real money needs invariant checks the code can't bypass ('we should never buy 200% above market').
Circuit Breaker
Netflix's Christmas Eve outage that built Chaos Engineering · 2012
Postmortem ↗
An ELB failure in AWS us-east-1 took Netflix down on Christmas Eve, directly causing Chaos Monkey's creation.
On Christmas Eve 2012, Amazon suffered an ELB (Elastic Load Balancer) failure in us-east-1. Netflix's entire streaming infrastructure ran out of a single AWS region at the time. The ELB failure cascaded through Netflix's stack: API requests failed, playback failed, and millions of customers couldn't watch Netflix on Christmas Eve. Netflix's failover plan existed on paper but had never been exercised under real conditions. It failed. The incident directly caused Netflix to accelerate their multi-region active-active migration and to build Chaos Kong, the tool that kills an entire AWS region in production on a regular schedule, to ensure the failover path never goes stale. The lesson: a failover plan that's never been tested is a guess. You only trust what you've actually run.
Chaos Engineering Failover & Redundancy Circuit Breaker

War stories

Discord's message queue backs up and drops 1M+ events · 2023

Cloudflare Workers KV stale reads for 35 minutes · 2023

Twitter's self-inflicted API shutdown · 2023

Azure Active Directory outage: MFA breaks for 14 hours · 2023

Google Docs deletes documents for 0.001% of users · 2023

Slack's 5-hour outage from a cascading cache failure · 2022

Heroku's OAuth token breach: 7-week secret exposure · 2022

Fastly takes down the internet · 2021

Facebook locks itself out of its own datacenter · 2021

DoorDash Redis cluster overload cascades to full outage · 2021

Zoom's certificate expiry takes down 300M daily meeting users · 2020

Cloudflare regex CPU-bomb · 2019

Google Cloud networking failure: 4 hours, 3 regions · 2019

GitHub 24-hour partition · 2018

Amazon Prime Day collapses under its own launch load · 2018

AWS S3 us-east-1 melts the internet · 2017

GitLab database deletion · 2017

Stripe double-charges thousands of customers · 2016

Knight Capital: $440M in 45 minutes · 2012

Netflix's Christmas Eve outage that built Chaos Engineering · 2012