← Concepts
Reliability·3 min read

Heartbeat

Periodic 'I'm alive' signals between nodes so failures are detected within seconds, not minutes.

First time reading this? Start here

Plain English: every few seconds, each server pings its peers to say 'I'm still here.' If the pings stop, the others assume that server died and stop sending it work. Without heartbeats, failures take minutes to detect.

Used in:WhatsAppDistributed Cache
What it is

A small periodic message sent from one node to another (or to a coordinator) signalling that the sender is alive and healthy. Absence of heartbeats for some threshold triggers failover, removal from a load-balancer pool, or alerts.

The problem it solves

Distributed systems need to know when a peer has failed. TCP connections can stay 'open' long after the other side has crashed, so without explicit heartbeats, you'd discover failures only when the next request times out, potentially minutes later.

How it works

Each node sends a heartbeat every T seconds (typically 1-5s). The receiver tracks the last-seen timestamp. If no heartbeat arrives within K × T (typically K=3), the receiver declares the sender dead. Action: leader election kicks off, the dead node is removed from rotation, alerts fire.

Why use it
What it costs you
Where it shows up in our architectures
Gotchas

Your notes

Private to you