Deliberately inject failures into production to prove your resilience works before a real outage tests it for you.
Plain English: instead of hoping your failover and redundancy work, you intentionally break things (kill servers, add latency, cut network links), ideally in production, to find weaknesses before a real outage does. If killing a node causes an outage, better to learn it on a Tuesday afternoon than at 3am.
The discipline of running controlled experiments on a system by deliberately injecting failures (killing instances, adding network latency, severing dependencies, exhausting resources) to verify that the system tolerates them as designed. Popularized by Netflix's Chaos Monkey, it treats resilience as a hypothesis to be tested, not assumed.
Resilience features (failover, retries, circuit breakers, redundancy) are usually built and then never exercised until a real incident, when they often turn out to be broken. Untested failure paths are a false sense of safety. Chaos engineering surfaces these weaknesses on your schedule, under observation, instead of during a 3am outage when the blast radius and stress are maximal.
Form a hypothesis about steady-state behavior ('p99 latency stays under 200ms and error rate under 0.1%'). Define a small blast radius. Inject a real-world fault: terminate an instance, add latency to a dependency, drop a network link, fill a disk. Observe whether steady state holds. If it breaks, you've found a weakness to fix; if it holds, confidence increases. Mature programs run continuously and automatically (Chaos Monkey randomly kills production instances) with automated guardrails to halt the experiment if it goes too far.
The origin of Chaos Monkey, which randomly terminates production instances so engineers are forced to build services that survive node loss
Injecting latency and killing service instances validates that dispatch degrades gracefully and circuit breakers trip during a regional surge
Simulating Stripe/card-network slowness verifies the circuit breakers and idempotent retries hold without double-charging