Reliability·3 min read

Heartbeat

Periodic 'I'm alive' signals between nodes so failures are detected within seconds, not minutes.

Try it

Stop a node's heartbeat. After 3 missed beats the monitor declares it dead.

Monitor

♥ beating

Each node periodically pings a monitor to say "still alive." Miss enough beats in a row and the monitor assumes the node is gone and reacts (reroutes traffic, triggers failover). The tuning tension: too sensitive and a slow network looks like death; too lax and real failures take ages to notice.

First time reading this? Start here

Plain English: every few seconds, each server pings its peers to say 'I'm still here.' If the pings stop, the others assume that server died and stop sending it work. Without heartbeats, failures take minutes to detect.

Used in:WhatsApp Distributed Cache

What it is

A small periodic message sent from one node to another (or to a coordinator) signalling that the sender is alive and healthy. Absence of heartbeats for some threshold triggers failover, removal from a load-balancer pool, or alerts.

The problem it solves

Distributed systems need to know when a peer has failed. TCP connections can stay 'open' long after the other side has crashed, so without explicit heartbeats, you'd discover failures only when the next request times out, potentially minutes later.

How it works

Each node sends a heartbeat every T seconds (typically 1-5s). The receiver tracks the last-seen timestamp. If no heartbeat arrives within K × T (typically K=3), the receiver declares the sender dead. Action: leader election kicks off, the dead node is removed from rotation, alerts fire.

Why use it

Fast failure detection (seconds, not minutes)
Simple to implement and reason about
Used by basically every cluster manager (Kubernetes, Consul, Zookeeper)

What it costs you

False positives: network blips cause healthy nodes to be marked dead
Heartbeat traffic adds up in large clusters (gossip helps but adds complexity)
Tuning the threshold is a real tradeoff: too short = flapping, too long = slow detection

Where it shows up in our architectures

WhatsApp →
Implicit in the WebSocket layer, where ping/pong frames detect dropped connections
Distributed Cache →
ZooKeeper monitors cache node liveness via session heartbeats

Gotchas

Configure thresholds based on real network behavior. 1-second heartbeats with 3-strike-you're-out works for healthy LANs; cellular networks need much more slack.
Heartbeat-based failover can be wrong: the 'dead' node might just be partitioned. Pair with quorum to avoid split-brain.
Gossip protocols (Cassandra) scale heartbeats to thousands of nodes, which direct N×N heartbeats can't.

Interview angle

Heartbeat is a low-key concept but it comes up whenever you're designing a system with leader election, worker pools, or health-checked service discovery. The signal to give is that TCP keepalive is not enough because it only detects broken connections, not a process that's frozen but still connected. Mention the failure detector problem: if a node stops sending heartbeats, you can't tell if it's dead or just partitioned, so pair heartbeat with quorum to avoid split-brain before declaring a node dead.

Your notes

Private to you