Reqflow
← All concepts
Distributed Systems·3 min read

Leader Election

Pick exactly one node to coordinate, and re-pick safely when it dies, without ever ending up with two leaders.

Try it

Kill the leader. The remaining nodes elect a new one.

👑
N1
leader
N2
follower
N3
follower
N4
follower

N1 is the leader for term 1.

Many systems need exactly one node in charge (to coordinate writes or assign work). When that leader dies, the survivors run an election and pick a new one in a higher term, so the cluster keeps a single coordinator without anyone stepping in. Algorithms like Raft formalize this.

First time reading this? Start here

Plain English: some jobs need exactly one node in charge (one writer, one scheduler). Leader election is how a cluster agrees on who that is, and how it picks a new boss when the current one dies, while guaranteeing you never accidentally get two bosses (split-brain), which corrupts data.

What it is

A coordination mechanism by which a set of nodes agree on a single 'leader' responsible for some exclusive role: accepting writes, assigning work, or coordinating others. It's a core building block implemented by consensus protocols (Raft, Paxos, ZAB) and coordination services (ZooKeeper, etcd).

The problem it solves

Many tasks must be done by exactly one node: a single write primary, a single job scheduler, a single sequence generator. Hard-coding the leader makes it a single point of failure. Leader election lets the cluster choose a leader dynamically and, critically, elect a new one when the leader fails, without two nodes both believing they're in charge (split-brain), which causes divergent writes and data corruption.

How it works

Nodes detect leader failure via heartbeats. When the leader's heartbeats stop, candidates start an election. Consensus protocols require a candidate to win votes from a majority quorum (N/2+1) before becoming leader, and this is what prevents two leaders, since two different majorities can't exist simultaneously. The new leader operates within a bounded 'term'/'epoch'; stale leaders that come back are fenced off by the higher epoch number. Coordination services expose this as ephemeral nodes or leases: hold the lease (renewed via heartbeat) and you're leader; lose it and someone else takes over.

Why use it

  • Removes the single-point-of-failure of a hard-coded coordinator, since failover is automatic
  • Quorum-based election prevents split-brain by construction (two majorities can't coexist)
  • Available off-the-shelf via Raft/etcd/ZooKeeper, so you rarely need to build it yourself

What it costs you

  • Election takes time: there's a leaderless window (seconds) during failover where writes stall
  • Needs a majority alive; lose quorum (network partition isolating the majority) and the cluster can't elect anyone and halts
  • Subtle to get right: naive timeout-based election without fencing tokens reintroduces split-brain

Where it shows up in our architectures

  • Apache Kafka

    Each partition has an elected leader replica that handles all reads/writes; on broker failure a new leader is elected from the in-sync replicas

  • Distributed Cache

    ZooKeeper/etcd elect a coordinator for ring membership and config; cache nodes follow the elected leader's view

  • Stock Exchange (Matching Engine)

    The single active matching engine per symbol is the leader; a hot standby is promoted via election on failure to keep one authoritative order book

Gotchas

  • Never roll your own leader election with bare timeouts; without quorum and fencing tokens you'll get split-brain under partition. Use Raft/etcd/ZooKeeper.
  • Election isn't instant. Plan for a failover window where writes pause; tune heartbeat/timeout to balance fast failover against false elections from network blips.
  • Fencing tokens (monotonic epoch numbers) are essential: a 'dead' leader that was only partitioned can come back and try to act. The higher epoch must win or it corrupts state.
  • Losing quorum stops the world. A 3-node cluster split 1-2 keeps working (the 2-side); split into three isolated nodes can't elect anyone. Size and place nodes to keep a majority reachable.
Interview angle

Leader election comes up whenever you have a single-writer system or a distributed scheduler. The key insight to convey is that you can't do leader election with just timeouts because a partitioned node can still think it's the leader. You must use a quorum-based protocol (Raft) or a coordination service (etcd, ZooKeeper). Mention fencing tokens to show you know how to handle a 'zombie leader' that comes back from a partition. Candidates who say 'just use a heartbeat and promote on timeout' will face immediate follow-up questions about split-brain they can't answer.

Your notes

Private to you