Dependency Explanations: LLM agents that earn trust

case study·2025 - present·9 min read

llm agentsevalsmcpreliability

$Summary - An LLM agent harness over 16 purpose-built tools that investigates every service-dependency edge and writes a human-readable explanation with a risk classification. Near-100% coverage at ~$0.15 per pair. Details simplified for public sharing.

Context

The canonical dependency graph answers what depends on what. It can't answer the question engineers actually ask during reviews and incidents: why does this dependency exist, and is it safe to remove? Answering that manually means reading code, configs, and traffic samples - twenty minutes per edge, across hundreds of thousands of edges. It never happened.

Approach: tools over prompts

The naive version - stuff context into a prompt, ask a model to guess - produced fluent nonsense. What worked was inverting the design: a constrained agent loop where the model's job is to investigate, not recall.

16 purpose-built tools, each answering one narrow question well: who owns this service, what does the client config declare, what does sampled traffic show, where in code is the call site, what changed recently.
Structured output - every run ends in an explanation plus a typed risk classification, not free text.
Budgeted loops - the agent gets a bounded number of tool calls; ambiguity is surfaced as "low confidence," never papered over.

# the contract, simplified
explain(edge) -> {
  explanation: str,        # grounded in tool evidence, with citations
  classification: RiskTier,
  confidence: float,
  evidence: [ToolCall],    # auditable trail
}

Evals were the product

The harness only became trustworthy when the eval suite did. We built a golden set of edges with expert-verified explanations, scored every change against it, and treated regressions like test failures. Prompt tweaks, tool changes, and model upgrades all gated on evals. The discipline is boring; it is also the entire reason coverage reached ~100% without a human reviewing every output.

Getting to ~$0.15 a pair

Prompt caching for the static harness preamble and tool schemas.
Model routing - a small model handles easy edges; escalation to a larger model only on low confidence.
Evidence reuse - tool results are cached across edges that share a service.

Decisions & tradeoffs

Tools over context-stuffing. Sixteen narrow tools were slower to build than one retrieval index, but every sentence in an explanation traces to a tool result. When the answer has to be auditable, evidence beats similarity (longer version).
Structured output over fluent prose. Typed risk tiers and citation fields cost the model expressiveness - and made the output consumable by the protection program and debuggable by humans. Prose would have been prettier, and useless.
Bounded loops over "let it think." A capped tool budget with an explicit low-confidence exit made cost predictable and uncertainty honest. Unbounded agent loops fail expensively and confidently.
Small model by default, escalate on doubt. Routing risks worse answers on edges that only look easy - the golden-set evals are what made that risk measurable instead of scary.

Impact

Near-100% of dependency edges carry a current, grounded explanation.
Risk classifications feed the protection program and the fault-injection loop.
Incident responders read the explanation instead of paging the owning team at 3am.

Lessons

Agents earn trust through tools and evals, not model choice. Every hour spent making a tool's output cleaner beat an hour of prompt engineering. And structured output with an auditable evidence trail is what let humans trust - and debug - the system.

What I'd do differently: build the golden set first. The earliest weeks tuned the harness against vibes, and every one of those "improvements" had to be re-litigated once real evals existed. I'd also record full tool-call traces from day one - debugging early runs without them was archaeology.

Dependency Explanations: LLM agents that earn trust

Context

Approach: tools over prompts

Evals were the product

Getting to ~$0.15 a pair

Decisions & tradeoffs

Impact

Lessons

Prabhav Nalhe