← selected work / case study

Dependency Explanations: LLM agents that earn trust

$Summary — An LLM agent harness over 16 purpose-built tools that investigates every service-dependency edge and writes a human-readable explanation with a risk classification. Near-100% coverage at ~$0.15 per pair. Details simplified for public sharing.

Context

The canonical dependency graph answers what depends on what. It can't answer the question engineers actually ask during reviews and incidents: why does this dependency exist, and is it safe to remove? Answering that manually means reading code, configs, and traffic samples — twenty minutes per edge, across hundreds of thousands of edges. It never happened.

Approach: tools over prompts

The naive version — stuff context into a prompt, ask a model to guess — produced fluent nonsense. What worked was inverting the design: a constrained agent loop where the model's job is to investigate, not recall.

  • 16 purpose-built tools, each answering one narrow question well: who owns this service, what does the client config declare, what does sampled traffic show, where in code is the call site, what changed recently.
  • Structured output — every run ends in an explanation plus a typed risk classification, not free text.
  • Budgeted loops — the agent gets a bounded number of tool calls; ambiguity is surfaced as "low confidence," never papered over.
# the contract, simplified
explain(edge) -> {
  explanation: str,        # grounded in tool evidence, with citations
  classification: RiskTier,
  confidence: float,
  evidence: [ToolCall],    # auditable trail
}

Evals were the product

The harness only became trustworthy when the eval suite did. We built a golden set of edges with expert-verified explanations, scored every change against it, and treated regressions like test failures. Prompt tweaks, tool changes, and model upgrades all gated on evals. The discipline is boring; it is also the entire reason coverage reached ~100% without a human reviewing every output.

Getting to ~$0.15 a pair

  • Prompt caching for the static harness preamble and tool schemas.
  • Model routing — a small model handles easy edges; escalation to a larger model only on low confidence.
  • Evidence reuse — tool results are cached across edges that share a service.

Impact

  • Near-100% of dependency edges carry a current, grounded explanation.
  • Risk classifications feed the protection program and the fault-injection loop.
  • Incident responders read the explanation instead of paging the owning team at 3am.

Lessons

Agents earn trust through tools and evals, not model choice. Every hour spent making a tool's output cleaner beat an hour of prompt engineering. And structured output with an auditable evidence trail is what let humans trust — and debug — the system.