A deep look at Netflix's Active-Active multi-region architecture and what it actually took to get there.
Netflix serves over 700 million streaming hours per day from data centers across three AWS regions. If any one region goes dark (hardware failure, network partition, deployment gone wrong), the other two absorb the traffic automatically, with no manual intervention and no visible outage for users. Getting there took five years. This is what they built.
Netflix originally ran an active-passive setup: one primary region handles all traffic, a standby region replicates behind it. On a failover event, traffic is routed to the standby. Sounds clean. In practice it has two fatal flaws. First, the standby region is cold: no warm caches, no pre-scaled capacity. You fail over right as traffic spikes, and you get a thundering herd that can topple the standby too. Second, failovers are infrequent enough that they atrophy. The runbooks get stale, the engineers who wrote the procedures move teams, and the first time you actually need the failover it fails in a new way. Netflix saw this firsthand during a 2012 Christmas Eve outage that knocked them offline for hours.
The insight: the only reliable failover is no failover. If every region handles live traffic every day, then losing a region just means the other two handle more of the same traffic they are already handling. Caches stay warm. Capacity is already scaled. The failure path is exercised continuously, not in a drill. Netflix routes roughly equal traffic fractions to each region using weighted DNS (Route53). When a region fails, DNS shifts its weight to zero and the other regions absorb the load within seconds, faster than any manual runbook.
Active-Active sounds simple until you ask: what happens to a user's viewing state, their account preferences, their payment status, when writes are flying into multiple regions simultaneously? Netflix made a deliberate CAP theorem call: they chose availability over strict consistency. Most Netflix data (play history, viewing position, recommendations) is eventually consistent across regions. If you pause a show in Virginia and resume in Oregon, your position might be two seconds behind for a fraction of a second. Nobody notices. For the handful of operations that truly require consistency (billing, account creation), Netflix routes those writes to a single region and accepts higher latency.
Netflix built Chaos Monkey in 2010 and Chaos Kong in 2012 specifically because they knew that systems you don't test fail in production. Chaos Monkey randomly kills individual EC2 instances during business hours. Chaos Kong kills an entire AWS region. By running region failovers daily in lower environments and regularly in production, Netflix ensures the failure path is never stale. When an actual AWS region event happens, it looks to their systems like a well-rehearsed drill, because it is.
You probably don't have Netflix's scale or budget. But the principles transfer. Any system where your disaster recovery procedure is only exercised in drills will fail differently in real events. The best way to make failover reliable is to make it continuous. Start smaller: run your secondary region under real traffic today, even 5%. Ensure your caches are warmed. Write your consistency trade-offs explicitly. The system that never fails over is the one nobody trusts to work when it matters.