← Concepts
Reliability·3 min read

Chaos Engineering

Deliberately inject failures into production to prove your resilience works before a real outage tests it for you.

First time reading this? Start here

Plain English: instead of hoping your failover and redundancy work, you intentionally break things (kill servers, add latency, cut network links), ideally in production, to find weaknesses before a real outage does. If killing a node causes an outage, better to learn it on a Tuesday afternoon than at 3am.

Used in:NetflixUberPayment Gateway
What it is

The discipline of running controlled experiments on a system by deliberately injecting failures (killing instances, adding network latency, severing dependencies, exhausting resources) to verify that the system tolerates them as designed. Popularized by Netflix's Chaos Monkey, it treats resilience as a hypothesis to be tested, not assumed.

The problem it solves

Resilience features (failover, retries, circuit breakers, redundancy) are usually built and then never exercised until a real incident, when they often turn out to be broken. Untested failure paths are a false sense of safety. Chaos engineering surfaces these weaknesses on your schedule, under observation, instead of during a 3am outage when the blast radius and stress are maximal.

How it works

Form a hypothesis about steady-state behavior ('p99 latency stays under 200ms and error rate under 0.1%'). Define a small blast radius. Inject a real-world fault: terminate an instance, add latency to a dependency, drop a network link, fill a disk. Observe whether steady state holds. If it breaks, you've found a weakness to fix; if it holds, confidence increases. Mature programs run continuously and automatically (Chaos Monkey randomly kills production instances) with automated guardrails to halt the experiment if it goes too far.

Why use it
What it costs you
Where it shows up in our architectures
Gotchas

Your notes

Private to you