← selected work / case study

Fault-Injection Loop: chaos as a labeling machine

$Summary — A chaos-engineering program that injects controlled faults to produce ground-truth dependency-criticality labels, which continuously sharpen the risk classifier. It retired risky dependencies and protected nine-figure revenue exposure. Details simplified for public sharing.

Context

Dependency criticality labels — "if B goes down, does A survive?" — were heuristic guesses derived from traffic volume and config hints. Guesses are fine until an incident proves them wrong, which is the most expensive possible way to find out.

Approach: break it on purpose

The only honest answer comes from the experiment itself: inject a fault into the dependency under controlled conditions and observe whether the dependent service actually degrades. Each experiment converts one guess into a ground-truth label.

 classifier predicts ──▶ pick highest-uncertainty edges
        ▲                          │
        │                          ▼
 labels feed back ◀── inject fault, observe impact
                      (blast-radius capped, auto-abort)
an active-learning loop: every experiment makes the classifier better

Safety design

  • Blast-radius caps — experiments scope to a slice of traffic, never whole services.
  • Automated aborts — guardrail metrics trip rollback in seconds, no human in the loop required.
  • Scheduling discipline — never during peak, never during incident response, never two experiments overlapping in one dependency chain.

The counterintuitive lesson: a well-guarded fault-injection program is safer than not running one, because the alternative is running the same experiment unplanned, at peak, with customers watching.

The loop is the point

One-off chaos experiments produce trivia. The value came from closing the loop: labels train the classifier, the classifier's uncertainty picks the next experiments, and confident "this dependency is critical and unprotected" predictions trigger remediation — add a fallback, cache, or remove the dependency entirely.

Impact

  • Eliminated a long tail of risky dependencies before they caused incidents.
  • Protected nine-figure revenue exposure attributable to the retired dependency risks.
  • Criticality labels went from folklore to evidence — and the explanation agents cite them.

Lessons

Sell the program as label generation, not chaos. "We break things" gets you a meeting with legal; "we convert guesses into ground truth that makes every downstream system smarter" gets you adoption. The engineering was the easy half.