Deliberately inject failures into production to prove your resilience works before a real outage tests it for you.
Deliberately break a service and confirm the system survives.
System healthy. Run an experiment to test resilience.
experiments run: 0
Chaos engineering breaks things on purpose, in controlled experiments, to verify the system survives before a real outage proves it does not. Netflix's Chaos Monkey randomly kills production instances so engineers are forced to build for failure. You learn your weak spots on your terms.
Plain English: instead of hoping your failover and redundancy work, you intentionally break things (kill servers, add latency, cut network links), ideally in production, to find weaknesses before a real outage does. If killing a node causes an outage, better to learn it on a Tuesday afternoon than at 3am.
The discipline of running controlled experiments on a system by deliberately injecting failures (killing instances, adding network latency, severing dependencies, exhausting resources) to verify that the system tolerates them as designed. Popularized by Netflix's Chaos Monkey, it treats resilience as a hypothesis to be tested, not assumed.
Resilience features (failover, retries, circuit breakers, redundancy) are usually built and then never exercised until a real incident, when they often turn out to be broken. Untested failure paths are a false sense of safety. Chaos engineering surfaces these weaknesses on your schedule, under observation, instead of during a 3am outage when the blast radius and stress are maximal.
Form a hypothesis about steady-state behavior ('p99 latency stays under 200ms and error rate under 0.1%'). Define a small blast radius. Inject a real-world fault: terminate an instance, add latency to a dependency, drop a network link, fill a disk. Observe whether steady state holds. If it breaks, you've found a weakness to fix; if it holds, confidence increases. Mature programs run continuously and automatically (Chaos Monkey randomly kills production instances) with automated guardrails to halt the experiment if it goes too far.
The origin of Chaos Monkey, which randomly terminates production instances so engineers are forced to build services that survive node loss
Injecting latency and killing service instances validates that dispatch degrades gracefully and circuit breakers trip during a regional surge
Simulating Stripe/card-network slowness verifies the circuit breakers and idempotent retries hold without double-charging
An ELB failure in AWS us-east-1 took Netflix down on Christmas Eve, directly causing Chaos Monkey's creation.
On Christmas Eve 2012, Amazon suffered an ELB (Elastic Load Balancer) failure in us-east-1. Netflix's entire streaming infrastructure ran out of a single AWS region at the time. The ELB failure cascaded through Netflix's stack: API requests failed, playback failed, and millions of customers couldn't watch Netflix on Christmas Eve. Netflix's failover plan existed on paper but had never been exercised under real conditions. It failed. The incident directly caused Netflix to accelerate their multi-region active-active migration and to build Chaos Kong, the tool that kills an entire AWS region in production on a regular schedule, to ensure the failover path never goes stale. The lesson: a failover plan that's never been tested is a guess. You only trust what you've actually run.
Chaos engineering is a signal of operational maturity. Bring it up when an interviewer asks how you'd validate that your resilience mechanisms (circuit breakers, failover, retries) actually work. The key thing to say is that resilience features that are never tested are broken until proven otherwise. Mention Netflix's Chaos Monkey as the reference, and note that you need good observability and an abort mechanism before you run any experiment. Candidates who just describe the architecture without addressing how they'd verify it in production miss the operational depth question.