← selected work / case study
Fault-Injection Loop: chaos as a labeling machine
Context
Dependency criticality labels — "if B goes down, does A survive?" — were heuristic guesses derived from traffic volume and config hints. Guesses are fine until an incident proves them wrong, which is the most expensive possible way to find out.
Approach: break it on purpose
The only honest answer comes from the experiment itself: inject a fault into the dependency under controlled conditions and observe whether the dependent service actually degrades. Each experiment converts one guess into a ground-truth label.
classifier predicts ──▶ pick highest-uncertainty edges
▲ │
│ ▼
labels feed back ◀── inject fault, observe impact
(blast-radius capped, auto-abort)Safety design
- Blast-radius caps — experiments scope to a slice of traffic, never whole services.
- Automated aborts — guardrail metrics trip rollback in seconds, no human in the loop required.
- Scheduling discipline — never during peak, never during incident response, never two experiments overlapping in one dependency chain.
The counterintuitive lesson: a well-guarded fault-injection program is safer than not running one, because the alternative is running the same experiment unplanned, at peak, with customers watching.
The loop is the point
One-off chaos experiments produce trivia. The value came from closing the loop: labels train the classifier, the classifier's uncertainty picks the next experiments, and confident "this dependency is critical and unprotected" predictions trigger remediation — add a fallback, cache, or remove the dependency entirely.
Impact
- Eliminated a long tail of risky dependencies before they caused incidents.
- Protected nine-figure revenue exposure attributable to the retired dependency risks.
- Criticality labels went from folklore to evidence — and the explanation agents cite them.
Lessons
Sell the program as label generation, not chaos. "We break things" gets you a meeting with legal; "we convert guesses into ground truth that makes every downstream system smarter" gets you adoption. The engineering was the easy half.