Reqflow
← All concepts
Reliability·3 min read

Failover & Redundancy

Keep spare capacity standing by and switch to it automatically when something dies, so a component failure isn't an outage.

Try it

Kill the primary and watch the standby take over.

Traffic
Primary
active
Standby
standby

Traffic flows to the primary. The standby stays in sync, ready.

Redundancy means keeping a spare ready, not just hoping nothing breaks. The standby mirrors the primary, so when the primary dies, traffic fails over to the standby automatically. The cost is running (and paying for) hardware that mostly sits idle.

First time reading this? Start here

Plain English: redundancy means having backups (extra servers, replicas, data centers). Failover is automatically switching to a backup when the primary dies. Together they're how you turn 'a server crashed' into a non-event instead of a 3am outage.

What it is

Redundancy is having more capacity or copies than the minimum needed, so the loss of one doesn't cause failure: redundant servers, replicated data, multiple availability zones. Failover is the act of detecting a failure and shifting traffic or responsibility to a healthy standby. Together they're the backbone of high availability.

The problem it solves

Everything fails eventually: disks, machines, racks, whole data centers. Without redundancy, every component is a single point of failure and every failure is an outage. Redundancy removes the single points; failover makes recovery automatic and fast instead of a manual scramble. The combination is what lets a service advertise four or five nines of availability.

How it works

Redundancy comes in modes: active-active (all replicas serve traffic; losing one just reduces capacity) and active-passive (a standby waits, promoted on primary failure). Failover detects death via health checks/heartbeats, then redirects: a load balancer drops the dead instance from rotation, a follower DB is promoted to primary, or DNS/anycast shifts a region. Geographic redundancy spreads across availability zones and regions so a whole-datacenter loss is survivable. Quorum and leader election keep failover from causing split-brain.

Why use it

  • Turns component failures into non-events, since single points of failure are eliminated
  • Active-active also adds capacity and load distribution, not just safety
  • Geographic redundancy survives entire-datacenter and regional outages

What it costs you

  • You pay for idle capacity: active-passive standbys cost money to sit and wait
  • Failover itself is risky: untested failover paths fail when you finally need them, and promotion can cause split-brain or data loss
  • True multi-region redundancy forces hard data-consistency and replication-lag tradeoffs

Where it shows up in our architectures

  • WhatsApp

    Cassandra replication factor 3 across nodes means losing one replica triggers no downtime, so reads/writes continue against the survivors

  • Payment Gateway

    The Postgres ledger runs primary + standby with automatic promotion; money writes must survive a primary failure without loss

  • Amazon S3 (Object Storage)

    Erasure coding across multiple availability zones means object durability survives whole-AZ failure, not just disk failure

Gotchas

  • Untested failover is broken failover. The standby you've never failed over to will surprise you in the worst way, so run game days and chaos drills regularly.
  • Active-passive wastes the standby's capacity and the standby may be cold (stale caches, unwarmed connections) when promoted. Active-active avoids both but needs the app to tolerate concurrent writes.
  • Automatic failover can cause split-brain or data loss if it promotes a lagging replica. Pair promotion with quorum/fencing so two primaries can't coexist.
  • Redundancy at one layer doesn't help if a shared dependency fails: redundant app servers all talking to one database still have a single point of failure.
When this went wrong in production

Netflix's Christmas Eve outage that built Chaos Engineering · 2012

Postmortem ↗

An ELB failure in AWS us-east-1 took Netflix down on Christmas Eve, directly causing Chaos Monkey's creation.

On Christmas Eve 2012, Amazon suffered an ELB (Elastic Load Balancer) failure in us-east-1. Netflix's entire streaming infrastructure ran out of a single AWS region at the time. The ELB failure cascaded through Netflix's stack: API requests failed, playback failed, and millions of customers couldn't watch Netflix on Christmas Eve. Netflix's failover plan existed on paper but had never been exercised under real conditions. It failed. The incident directly caused Netflix to accelerate their multi-region active-active migration and to build Chaos Kong, the tool that kills an entire AWS region in production on a regular schedule, to ensure the failover path never goes stale. The lesson: a failover plan that's never been tested is a guess. You only trust what you've actually run.

Interview angle

Failover and redundancy questions test whether you design for failure as a first-class requirement. The key thing to say is that redundancy is meaningless if you never test failover, and most teams discover their failover is broken during an actual outage. Proactively mention that you'd run chaos engineering or game days, and flag that active-passive standbys can be cold when promoted. Candidates lose points by saying 'just add a replica' without addressing how failover is triggered, tested, and protected against split-brain.

Your notes

Private to you