← selected work / case study
Dependency Explanations: LLM agents that earn trust
Context
The canonical dependency graph answers what depends on what. It can't answer the question engineers actually ask during reviews and incidents: why does this dependency exist, and is it safe to remove? Answering that manually means reading code, configs, and traffic samples — twenty minutes per edge, across hundreds of thousands of edges. It never happened.
Approach: tools over prompts
The naive version — stuff context into a prompt, ask a model to guess — produced fluent nonsense. What worked was inverting the design: a constrained agent loop where the model's job is to investigate, not recall.
- 16 purpose-built tools, each answering one narrow question well: who owns this service, what does the client config declare, what does sampled traffic show, where in code is the call site, what changed recently.
- Structured output — every run ends in an explanation plus a typed risk classification, not free text.
- Budgeted loops — the agent gets a bounded number of tool calls; ambiguity is surfaced as "low confidence," never papered over.
# the contract, simplified
explain(edge) -> {
explanation: str, # grounded in tool evidence, with citations
classification: RiskTier,
confidence: float,
evidence: [ToolCall], # auditable trail
}
Evals were the product
The harness only became trustworthy when the eval suite did. We built a golden set of edges with expert-verified explanations, scored every change against it, and treated regressions like test failures. Prompt tweaks, tool changes, and model upgrades all gated on evals. The discipline is boring; it is also the entire reason coverage reached ~100% without a human reviewing every output.
Getting to ~$0.15 a pair
- Prompt caching for the static harness preamble and tool schemas.
- Model routing — a small model handles easy edges; escalation to a larger model only on low confidence.
- Evidence reuse — tool results are cached across edges that share a service.
Impact
- Near-100% of dependency edges carry a current, grounded explanation.
- Risk classifications feed the protection program and the fault-injection loop.
- Incident responders read the explanation instead of paging the owning team at 3am.
Lessons
Agents earn trust through tools and evals, not model choice. Every hour spent making a tool's output cleaner beat an hour of prompt engineering. And structured output with an auditable evidence trail is what let humans trust — and debug — the system.