← Concepts
Reliability·3 min read

Failover & Redundancy

Keep spare capacity standing by and switch to it automatically when something dies, so a component failure isn't an outage.

First time reading this? Start here

Plain English: redundancy means having backups (extra servers, replicas, data centers). Failover is automatically switching to a backup when the primary dies. Together they're how you turn 'a server crashed' into a non-event instead of a 3am outage.

Used in:WhatsAppPayment GatewayAmazon S3 (Object Storage)
What it is

Redundancy is having more capacity or copies than the minimum needed, so the loss of one doesn't cause failure: redundant servers, replicated data, multiple availability zones. Failover is the act of detecting a failure and shifting traffic or responsibility to a healthy standby. Together they're the backbone of high availability.

The problem it solves

Everything fails eventually: disks, machines, racks, whole data centers. Without redundancy, every component is a single point of failure and every failure is an outage. Redundancy removes the single points; failover makes recovery automatic and fast instead of a manual scramble. The combination is what lets a service advertise four or five nines of availability.

How it works

Redundancy comes in modes: active-active (all replicas serve traffic; losing one just reduces capacity) and active-passive (a standby waits, promoted on primary failure). Failover detects death via health checks/heartbeats, then redirects: a load balancer drops the dead instance from rotation, a follower DB is promoted to primary, or DNS/anycast shifts a region. Geographic redundancy spreads across availability zones and regions so a whole-datacenter loss is survivable. Quorum and leader election keep failover from causing split-brain.

Why use it
What it costs you
Where it shows up in our architectures
Gotchas

Your notes

Private to you