Fault-Injection Loop: chaos as a labeling machine

case study·2025 - present·8 min read

chaos engineeringml labelsreliability

$Summary - A chaos-engineering program that injects controlled faults to produce ground-truth dependency-criticality labels, which continuously sharpen the risk classifier. It retired risky dependencies and protected nine-figure revenue exposure. Details simplified for public sharing.

Context

Dependency criticality labels - "if B goes down, does A survive?" - were heuristic guesses derived from traffic volume and config hints. Guesses are fine until an incident proves them wrong, which is the most expensive possible way to find out.

Approach: break it on purpose

The only honest answer comes from the experiment itself: inject a fault into the dependency under controlled conditions and observe whether the dependent service actually degrades. Each experiment converts one guess into a ground-truth label.

 classifier predicts ──▶ pick highest-uncertainty edges
        ▲                          │
        │                          ▼
 labels feed back ◀── inject fault, observe impact
                      (blast-radius capped, auto-abort)

an active-learning loop: every experiment makes the classifier better

Safety design

Blast-radius caps - experiments scope to a slice of traffic, never whole services.
Automated aborts - guardrail metrics trip rollback in seconds, no human in the loop required.
Scheduling discipline - never during peak, never during incident response, never two experiments overlapping in one dependency chain.

The counterintuitive lesson: a well-guarded fault-injection program is safer than not running one, because the alternative is running the same experiment unplanned, at peak, with customers watching.

The loop is the point

One-off chaos experiments produce trivia. The value came from closing the loop: labels train the classifier, the classifier's uncertainty picks the next experiments, and confident "this dependency is critical and unprotected" predictions trigger remediation - add a fallback, cache, or remove the dependency entirely.

Decisions & tradeoffs

Uncertainty-driven selection over coverage sweeps. Letting the classifier pick the next experiments meant fewer, more informative runs - and accepting that well-understood edges never get "tested" at all.
Auto-abort authority over human approval. Guardrail metrics could roll back an experiment with no human in the loop. Contentious to grant - but a program that needs an engineer watching every run can never scale past a demo.
Traffic slices over full-service kills. Scoped experiments give a weaker signal per run and a bounded worst case. That bound is what turned "absolutely not" into "yes" in review.
A slower loop over a busier one. No peak windows, no overlapping experiments in one dependency chain. The lost throughput bought the safety record the program's credibility lived on.

Impact

Eliminated a long tail of risky dependencies before they caused incidents.
Protected nine-figure revenue exposure attributable to the retired dependency risks.
Criticality labels went from folklore to evidence - and the explanation agents cite them.

Lessons

Sell the program as label generation, not chaos. "We break things" gets you a meeting with legal; "we convert guesses into ground truth that makes every downstream system smarter" gets you adoption. The engineering was the easy half.

What I'd do differently: close the classifier loop earlier - the first experiments were hand-picked and mostly confirmed what people already believed. And build experiment provenance (exact scope, guardrails, and timeline attached to every label) from the start, so a disputed label gets re-run instead of re-argued.

Fault-Injection Loop: chaos as a labeling machine

Context

Approach: break it on purpose

Safety design

The loop is the point

Decisions & tradeoffs

Impact

Lessons

Prabhav Nalhe