Reliability·3 min read

Chaos Engineering

Deliberately inject failures into production to prove your resilience works before a real outage tests it for you.

Try it

Deliberately break a service and confirm the system survives.

Web

Cache

Queue

Worker

System status: UP

System healthy. Run an experiment to test resilience.

experiments run: 0

Chaos engineering breaks things on purpose, in controlled experiments, to verify the system survives before a real outage proves it does not. Netflix's Chaos Monkey randomly kills production instances so engineers are forced to build for failure. You learn your weak spots on your terms.

First time reading this? Start here

Plain English: instead of hoping your failover and redundancy work, you intentionally break things (kill servers, add latency, cut network links), ideally in production, to find weaknesses before a real outage does. If killing a node causes an outage, better to learn it on a Tuesday afternoon than at 3am.

Used in:Netflix Uber Payment Gateway

What it is

The discipline of running controlled experiments on a system by deliberately injecting failures (killing instances, adding network latency, severing dependencies, exhausting resources) to verify that the system tolerates them as designed. Popularized by Netflix's Chaos Monkey, it treats resilience as a hypothesis to be tested, not assumed.

The problem it solves

Resilience features (failover, retries, circuit breakers, redundancy) are usually built and then never exercised until a real incident, when they often turn out to be broken. Untested failure paths are a false sense of safety. Chaos engineering surfaces these weaknesses on your schedule, under observation, instead of during a 3am outage when the blast radius and stress are maximal.

How it works

Form a hypothesis about steady-state behavior ('p99 latency stays under 200ms and error rate under 0.1%'). Define a small blast radius. Inject a real-world fault: terminate an instance, add latency to a dependency, drop a network link, fill a disk. Observe whether steady state holds. If it breaks, you've found a weakness to fix; if it holds, confidence increases. Mature programs run continuously and automatically (Chaos Monkey randomly kills production instances) with automated guardrails to halt the experiment if it goes too far.

Why use it

Finds resilience gaps on your schedule instead of during a real outage
Continuously validates that failover, retries, and redundancy actually work as designed
Builds organizational confidence and forces good observability and automated recovery

What it costs you

Running it in production is genuinely risky: a poorly-scoped experiment can cause the outage it was meant to prevent
Requires solid observability and automated rollback first, because chaos without monitoring is just sabotage
Cultural and organizational buy-in is hard; deliberately breaking production is a tough sell without trust

Where it shows up in our architectures

Netflix →
The origin of Chaos Monkey, which randomly terminates production instances so engineers are forced to build services that survive node loss
Uber →
Injecting latency and killing service instances validates that dispatch degrades gracefully and circuit breakers trip during a regional surge
Payment Gateway →
Simulating Stripe/card-network slowness verifies the circuit breakers and idempotent retries hold without double-charging

Gotchas

Don't run chaos in production until you have the observability and automated rollback to detect and stop a runaway experiment; otherwise you're just causing outages.
Start with the smallest blast radius (one instance, one dependency, off-peak) and expand as confidence grows. Game days in staging first, then carefully into prod.
Always have an abort button and automated guardrails that halt the experiment if steady-state SLOs are violated. A chaos experiment that you can't stop is an incident.
Chaos engineering tests resilience you've already built; it's not a substitute for designing for failure (redundancy, failover, idempotency). It validates, it doesn't create.

When this went wrong in production

Netflix's Christmas Eve outage that built Chaos Engineering · 2012

Postmortem ↗

An ELB failure in AWS us-east-1 took Netflix down on Christmas Eve, directly causing Chaos Monkey's creation.

On Christmas Eve 2012, Amazon suffered an ELB (Elastic Load Balancer) failure in us-east-1. Netflix's entire streaming infrastructure ran out of a single AWS region at the time. The ELB failure cascaded through Netflix's stack: API requests failed, playback failed, and millions of customers couldn't watch Netflix on Christmas Eve. Netflix's failover plan existed on paper but had never been exercised under real conditions. It failed. The incident directly caused Netflix to accelerate their multi-region active-active migration and to build Chaos Kong, the tool that kills an entire AWS region in production on a regular schedule, to ensure the failover path never goes stale. The lesson: a failover plan that's never been tested is a guess. You only trust what you've actually run.

All war stories →

Interview angle

Chaos engineering is a signal of operational maturity. Bring it up when an interviewer asks how you'd validate that your resilience mechanisms (circuit breakers, failover, retries) actually work. The key thing to say is that resilience features that are never tested are broken until proven otherwise. Mention Netflix's Chaos Monkey as the reference, and note that you need good observability and an abort mechanism before you run any experiment. Candidates who just describe the architecture without addressing how they'd verify it in production miss the operational depth question.

Your notes

Private to you