Keep spare capacity standing by and switch to it automatically when something dies, so a component failure isn't an outage.
Kill the primary and watch the standby take over.
Traffic flows to the primary. The standby stays in sync, ready.
Redundancy means keeping a spare ready, not just hoping nothing breaks. The standby mirrors the primary, so when the primary dies, traffic fails over to the standby automatically. The cost is running (and paying for) hardware that mostly sits idle.
Plain English: redundancy means having backups (extra servers, replicas, data centers). Failover is automatically switching to a backup when the primary dies. Together they're how you turn 'a server crashed' into a non-event instead of a 3am outage.
Redundancy is having more capacity or copies than the minimum needed, so the loss of one doesn't cause failure: redundant servers, replicated data, multiple availability zones. Failover is the act of detecting a failure and shifting traffic or responsibility to a healthy standby. Together they're the backbone of high availability.
Everything fails eventually: disks, machines, racks, whole data centers. Without redundancy, every component is a single point of failure and every failure is an outage. Redundancy removes the single points; failover makes recovery automatic and fast instead of a manual scramble. The combination is what lets a service advertise four or five nines of availability.
Redundancy comes in modes: active-active (all replicas serve traffic; losing one just reduces capacity) and active-passive (a standby waits, promoted on primary failure). Failover detects death via health checks/heartbeats, then redirects: a load balancer drops the dead instance from rotation, a follower DB is promoted to primary, or DNS/anycast shifts a region. Geographic redundancy spreads across availability zones and regions so a whole-datacenter loss is survivable. Quorum and leader election keep failover from causing split-brain.
Cassandra replication factor 3 across nodes means losing one replica triggers no downtime, so reads/writes continue against the survivors
The Postgres ledger runs primary + standby with automatic promotion; money writes must survive a primary failure without loss
Erasure coding across multiple availability zones means object durability survives whole-AZ failure, not just disk failure
An ELB failure in AWS us-east-1 took Netflix down on Christmas Eve, directly causing Chaos Monkey's creation.
On Christmas Eve 2012, Amazon suffered an ELB (Elastic Load Balancer) failure in us-east-1. Netflix's entire streaming infrastructure ran out of a single AWS region at the time. The ELB failure cascaded through Netflix's stack: API requests failed, playback failed, and millions of customers couldn't watch Netflix on Christmas Eve. Netflix's failover plan existed on paper but had never been exercised under real conditions. It failed. The incident directly caused Netflix to accelerate their multi-region active-active migration and to build Chaos Kong, the tool that kills an entire AWS region in production on a regular schedule, to ensure the failover path never goes stale. The lesson: a failover plan that's never been tested is a guess. You only trust what you've actually run.
Failover and redundancy questions test whether you design for failure as a first-class requirement. The key thing to say is that redundancy is meaningless if you never test failover, and most teams discover their failover is broken during an actual outage. Proactively mention that you'd run chaos engineering or game days, and flag that active-passive standbys can be cold when promoted. Candidates lose points by saying 'just add a replica' without addressing how failover is triggered, tested, and protected against split-brain.