Closing the Loop: Fault Injection as a Ground-Truth Engine for an LLM Classifier

2026·6 min read

fault injectionml labelschaos eng

A pattern worth reusing: when an LLM makes a judgment at scale, do not let it grade itself. Pair a cheap predictor with an expensive verifier that observes ground truth by breaking things on purpose - and treat their disagreements as the most valuable signal you have.

flowchart LR
  C["LLM classifier
cheap, predicts fleet-wide"] -->|"predictions to check"| V["fault injection
expensive, verifies a sample"]
  V -->|"verified labels (disagreements)"| C
  classDef key fill:#e8f1fb,stroke:#1e1e1e,color:#1e1e1e
  class C key
  linkStyle 1 stroke:#2383E2,stroke-width:2px,color:#2383E2

A cheap predictor labels the whole fleet; an expensive verifier checks a sample by breaking things on purpose. Their disagreements become the verified labels that correct the predictor.

The problem: a classifier that is confident but unverifiable

Picture an LLM classifying, across a large fleet of services, whether each dependency is hard or soft. A hard dependency takes its caller down when it fails; a soft one degrades gracefully - a cache miss, a fallback path, a feature that quietly turns off. You fortify the hard ones and leave the soft ones alone, so the cost of a wrong label is real: either you burn engineering on phantom risk, or you leave a true single-point-of-failure exposed.

The classifier reads call sites, the introducing change, service metadata, and the surrounding code, and emits a label plus a failure-impact summary. Producing a confident label is easy. Knowing it is correct is the hard part. A model can read a code path and reason its way to "hard" with total fluency and still be wrong, because real behavior depends on retry policy, timeouts, circuit breakers, and fallbacks that are not always legible from the source. Self-reported confidence is not ground truth. So the real design question is: where does verified truth come from, at a scale that keeps the classifier honest over time?

The mechanism: disable one dependency, watch what breaks, record the label

The way out is to stop inferring criticality and start measuring it - by breaking things on purpose in a controlled, non-production environment. With a chaos and fault-injection framework, you can build an experiment harness that turns failures into labels. It runs one clean experiment at a time: take a single dependency edge, inject a fault that disables exactly that edge, hold everything else constant, and watch which SLOs and SLIs move.

If disabling the edge breaches the caller's error budget or drops a key SLI, the edge is hard, and the experiment records it as an observed outcome - not a model opinion. If the caller absorbs the loss within budget, it is soft. One-dependency-at-a-time isolation is the whole trick: when only one thing changed, the degradation you measure can only be caused by that thing. That is what makes the result a label instead of an anecdote.

The loop: experiments become labels, labels keep the classifier honest

Every experiment emits a row: this edge, under this fault, produced this SLO or SLI degradation, therefore this observed hardness. That is exactly the verified ground truth the classifier was missing, and it becomes the label set you cross-validate and tune against - so a prediction for an edge can be checked against what actually happened when that edge was severed in the lab.

The two halves are complementary. The classifier covers the whole fleet cheaply because inference is cheap; the harness covers a smaller slice expensively but with certainty. So you let the cheap thing predict everywhere and the expensive thing verify and correct. When the harness disagrees with the model - the model said soft, the experiment breached the budget - that disagreement is the highest-value signal in the system. It points straight at where reasoning-from-code misses production reality: an undocumented retry storm, a missing fallback, a shared resource the code did not make obvious. Those disagreements feed back as corrected reference labels and adjustments to the prompting and decision thresholds - no model weights are trained; the classifier is improved through eval and tuning. And a held-out slice of verified labels is reserved purely to score the model, so the grading stays honest too. The accuracy you report then means something, because it is measured against observed outcomes rather than the model's own confidence.

The evaluation methodology: code-verified truth, cross-validation, anti-contamination

An accuracy number is only worth as much as the methodology behind it. The eval should test the model against truth it could not see in advance, not flatter it. Three principles hold it together.

First, code-verified ground truth: anchor reference labels to verifiable artifacts - the actual call sites, the introducing change, the concrete failure behavior - so a label traces to evidence rather than vibes. Second, fault-injection cross-validation: check the model's predictions against the independent, experimentally-observed outcomes. The two pipelines derive truth differently - one reasons over code, the other breaks the dependency and watches the SLO - so agreement earns trust and disagreement reveals either a model error or a coverage gap, both useful. Third, anti-contamination: because this is an LLM classifier and not a trained model, the contamination risk is in-context, not in a training set. The verified outcome must never be visible to the model at inference time - not in the prompt, not in the tool results, not as a few-shot example - or you measure look-up instead of reasoning.

Two smaller decisions matter for trust. Lowering the model's sampling temperature (roughly 1.0 down to 0.2) reduces hallucinated code paths and makes runs reproducible, which you need when grading against a fixed label set and the same input must give the same output. It is a reproducibility lever, not a hallucination cure. And the LLM component here is not a retrieval-with-embeddings system; it is a tool-using agent harness that does structured retrieval by calling purpose-built tools - build-graph queries, source fetches, blame and diff attribution, traces, service metadata - and reasoning over what they return. There are no vector embeddings and no semantic search in the path; retrieval is explicit, auditable tool calls, which is part of why a label can always be traced back to its evidence.

Why this is the right shape for a self-improving system

The lesson outlasts any one system. If you have an LLM making a judgment at scale and an expensive way to observe ground truth at small scale, do not make the model self-certify and do not try to label everything by hand. Build the cheap predictor and the expensive verifier as separate pipelines that derive truth by different means, then wire the verifier's disagreements back into the predictor. The disagreements are where the model is wrong in ways its own confidence can never reveal, and they are finite - you only need to verify enough to keep the boundary calibrated, not the whole fleet.

The payoff is that accuracy stops being a one-time launch metric and becomes a maintained property. Services change, dependencies are added, fallbacks rot. A classifier graded once and trusted forever silently decays. A classifier with a continuous fault-injection loop behind it gets corrected by reality on a steady cadence, so its labels stay good as the system moves underneath it. That is the difference between a model that looked accurate on launch day and one that stays accurate.

Takeaways

Do not let an LLM self-certify a judgment that has an observable ground truth. Build a cheap predictor and an expensive verifier, and treat their disagreements as the highest-value signal.
Isolation is what makes fault injection a labeling engine: disable exactly one dependency, hold everything else constant, and the measured degradation is causally attributable to that edge - a real label, not an opinion.
An accuracy number is only as good as its eval. Code-verified ground truth, cross-validation against independently-observed outcomes, and anti-contamination (verified labels kept out of the prompt, tools, and few-shot examples) are what make it defensible instead of self-graded.
Lowering sampling temperature is a reproducibility lever, not a hallucination cure - necessary when you are grading a model against a fixed label set.
Continuous ground truth turns accuracy from a launch-day snapshot into a maintained property: the loop corrects for drift through eval and tuning, with no model weights trained.