← notes

Agent Harness vs RAG: When Structured Tool-Calls Beat Vector Search

Two ways to feed a model context: retrieve fuzzy passages by similarity, or call purpose-built tools that return exact records. Here is how I decide which one a problem actually needs - and why, when an answer has to trace back to evidence, the tool-calling agent wins.

Prabhav Nalhe · 2026 · ~6 min read
flowchart TB
  subgraph S1["RAG: retrieve, then read"]
    direction TB
    Q["embed the query"] --> VS["vector search, top-k"] --> RD["LLM reads chunks"]
  end
  subgraph S2["agent: call tools, reason, repeat"]
    direction TB
    AG["LLM agent"] -->|"call"| TL["typed tools"]
    TL -->|"result"| AG
  end
  classDef key fill:#e8f1fb,stroke:#1e1e1e,color:#1e1e1e
  class AG key
  linkStyle 2 stroke:#2383E2,color:#2383E2
  linkStyle 3 stroke:#2383E2,color:#2383E2
RAG retrieves chunks by similarity and reads them. An agent harness calls explicit, typed tools in a loop and reasons over structured results, so every answer traces back to its evidence.

Two ways to give a model context

There are two broad ways to put the right information in front of a language model. The first, retrieval-augmented generation, chunks a corpus, embeds the chunks into vectors, and at query time pulls back the passages whose embeddings sit closest to the query in that space. The model reads those passages and answers. The second is a tool-using agent harness: instead of similarity search, you give the model a fixed set of purpose-built tools - query this graph, fetch this file, look up this record, run this attribution - and it decides which tool to call, reads what comes back, and reasons forward, calling again if it needs more.

The distinction that matters is not the model. It is where the retrieval boundary sits. In RAG, retrieval is a fuzzy nearest-neighbor lookup over text, decided by an embedding model you mostly do not control. In an agent harness, retrieval is a sequence of explicit, typed calls against systems that already know the answer exactly. Both put context in the window. They differ in whether that context arrives by resemblance or by lookup - and that single difference drives almost everything else.

What RAG is genuinely good at

It would be a mistake to treat RAG as the weaker option. For a large, unstructured, fuzzy corpus - support tickets, documentation, transcripts, years of prose where the user's words will never match the source's words - embedding search is the right tool, and a tool-call approach has nothing to offer because there is no structured system to call. You cannot write a typed query against ten thousand wiki pages of inconsistent terminology. Similarity search is exactly the mechanism that bridges 'how the user phrased it' and 'how the document phrased it.'

RAG also scales cheaply across breadth. One embedding index covers a whole corpus, and a query is a single approximate-nearest-neighbor search whose cost grows far slower than the number of documents behind it. When the question is 'find me the handful of passages most likely relevant to this loosely-worded question,' and approximate is good enough, RAG is hard to beat. The failure mode is not that it is bad - it is that 'most likely relevant' and 'approximate' are sometimes unacceptable, and that is where the comparison gets interesting.

When structured tool-calls win

Tool-calls win when the domain already has structure and the answer must be exact rather than plausible. If the truth lives in a dependency graph, a code repository, a metadata store, or a change history, you do not want a paraphrase of it retrieved by resemblance - you want the record. A tool that issues a typed query against that system returns the actual call sites, the actual diff that introduced a change, the actual service metadata. There is no embedding in the middle to blur an exact match into a near one, and no chunk boundary to slice a function in half.

Three properties follow, and they are the whole case. Precision: a graph query either matches or it does not, so you do not retrieve something that merely reads like the answer. Freshness: a tool can read the live system at call time, so its result reflects the system's current state, with no embedding index to rebuild and no staleness window between an update and a re-embedding. (This holds only as far as the backing system itself is live; a tool that reads a nightly snapshot is exactly as stale as that snapshot.) And traceability: because each tool call and its result are explicit, every claim in the final answer traces back to a concrete artifact you can point at - this file, this edge, this commit. The model's reasoning is auditable because its inputs are auditable. With pure similarity retrieval you can usually say a passage was 'near' the query in vector space; you often cannot say why, and you cannot guarantee the passage is the authoritative record rather than merely the nearest neighbor in vector space.

To ground this in my own practice: the retrieval I build is the tool-call kind, and I do not run vector embeddings or semantic search in that path. The model reasons over what structured tools return - graph queries, source fetches, blame and diff attribution, metadata lookups - not over embedded passages. I made that choice deliberately, for the reasons above: in this domain the answer has to trace to evidence, and tool-calls give me that trace.

Traceability is the real dividing line

If I had to compress the decision to one axis, it would be this: does the answer have to trace to evidence, or only sound right? Many tasks genuinely only need to sound right and helpful - a first-draft summary, a where-do-I-start pointer, a search over messy prose. Approximate retrieval serves those well, and over-engineering them into a typed tool layer is wasted effort.

But some answers are load-bearing. They get acted on - a label that decides whether a system gets hardened, a judgment that gates a change, a result a reviewer has to be able to challenge. For those, 'the model read some relevant-looking passages and produced this' is not a defensible provenance, because you cannot reconstruct why a specific passage was chosen or confirm it was authoritative. 'The model called this tool, got back this exact record, and reasoned from it' is defensible, because the chain is inspectable end to end. When a human or an audit can demand 'show me where this came from,' the tool-call path can answer with an artifact and the similarity path can only answer with a distance.

How to choose, and how to combine

In practice the choice is not ideological. Ask three questions of the domain. Is the source structured or is it a fuzzy text corpus - if it is genuinely unstructured prose, lean RAG, because there is nothing typed to call. Must the answer be exact and current, or is approximate-and-recent acceptable - if exactness or live freshness is required, lean tool-calls. And does the output need to trace to evidence - if a person will act on it or contest it, the auditability of explicit tool calls is worth its cost.

The two also compose, and the strongest systems use both with a clear seam. Embedding search is good at the first hop over a wide, messy space - narrowing from everything to a candidate set. Tool-calls are good at the second hop - taking a candidate and resolving it against the system of record to get the exact, current, traceable answer. Use similarity to find where to look, then use a typed tool to look. What you should not do is reach for vector search by default because it is the fashionable shape, when the domain is already structured and the answer has to be defensible. In a structured, evidence-bearing domain, a tool-calling harness is not a downgrade from RAG - it is the more auditable architecture, because every answer can be traced back to the exact tools it called and the records they returned.

Takeaways

  • The real difference between RAG and a tool-using agent is not the model - it is where retrieval happens: fuzzy nearest-neighbor over text versus explicit typed calls against a system that already knows the answer exactly.
  • RAG is the right tool for large, unstructured, fuzzy corpora where the user's words will never match the source's words; do not over-engineer those into a typed tool layer.
  • Tool-calls win when the domain is structured and the answer must be exact: precision (it matches or it does not), freshness (can read the live system, no embedding index to rebuild), and traceability (every claim points to a concrete artifact).
  • The dividing line is whether an answer must trace to evidence or only sound right. Load-bearing answers that get acted on or contested need the auditable provenance that explicit tool calls give and similarity retrieval cannot.
  • The two compose: use similarity search for the wide first hop to find where to look, then a typed tool for the second hop to resolve the exact, current, traceable record. Do not default to vector search in a domain that is already structured.
← more notes nprabhav111@gmail.com