← notes

Designing APIs for Agents, Not Humans

When an LLM is the caller, the interface is the prompt. Typed responses, idempotent writes, granular composable tools, and machine-readable errors are not nice-to-haves - they are the difference between an agent that chains calls reliably and one that scrapes a human UI and guesses.

Prabhav Nalhe · 2026 · ~6 min read
flowchart TD
  AG["agent"] -->|"reliable"| TT["typed tool
structured JSON"] AG -.->|"brittle"| HU["human CLI / UI
parse stdout"] classDef good fill:#eaf6ec,stroke:#1e1e1e,color:#1e1e1e classDef hot fill:#fde7e7,stroke:#1e1e1e,color:#1e1e1e class TT good class HU hot linkStyle 0 stroke:#2f9e44,color:#2f9e44 linkStyle 1 stroke:#c92a2a,color:#c92a2a
Design granular, typed tools that return structured results an agent can chain reliably, rather than making it scrape a human CLI or UI and parse free text.

The shift: the caller cannot read between the lines

For decades we designed APIs and CLIs for a human on the other end - someone who could read a paragraph of help text, eyeball a table, infer that "3 days ago" means a timestamp, and retry by hand when something looked off. An LLM agent has none of that slack. It does not see your interface; it sees a prompt, a tool schema, and a string of returned bytes, and it has to decide what to do next from that alone.

The failure I keep seeing is treating an agent like a fast human: point it at a CLI or a web UI built for people and let it scrape. It works in a demo and breaks quietly in production. The agent parses a table that gains a column, mistakes a banner for data, or reads "operation submitted" as success when the operation later failed. The fix is not a better model. It is an interface designed for a machine caller - one where the contract is explicit enough that chaining calls is a mechanical act, not an act of interpretation.

A useful test: could a caller with zero common sense, who takes every string literally and cannot ask a clarifying question, use this tool correctly? That caller is your agent.

Typed, structured responses over prose the model has to parse

The highest-leverage change is to return structured, typed data instead of human prose. An agent that gets back JSON with named fields - status as an enum, counts as integers, timestamps as ISO-8601, IDs as opaque strings - can branch on those fields directly. An agent that gets back "Found a few results, the newest from earlier today" has to re-derive meaning from English on every call, and it will be wrong some fraction of the time, silently.

The goal is to remove inference from the path. Every value the agent needs to make a decision should be a field it can read, not a phrase it has to interpret. Enumerate the states - do not let "pending," "in progress," and "running" be three strings that mean the same thing; pick one and document it in the schema. Keep shapes stable: a field that is sometimes a string and sometimes a list forces the model to handle both, and it will eventually handle neither. If a response can be empty, make empty an explicit, typed case (an empty list, a null with a reason) rather than a missing key the agent has to notice is absent.

This is also where retrieval should be honest about what it is. When I build retrieval for an agent, it is explicit, typed tool calls - fetch this artifact by ID, query this graph, return these fields - not a blob of fuzzy text the model has to skim. Auditable structured returns are what let you trace any decision back to the exact bytes that produced it.

Idempotency and machine-readable errors: designing for the retry

Agents retry. They retry because a call timed out, because the model decided to try again, because a higher-level planner re-ran a step. If your write endpoint is not idempotent, retries create duplicates - two payments, two tickets, two resources - and the agent has no way to know it just did damage. Give every state-changing operation an idempotency key the caller can supply, so a repeated call with the same key is a no-op that returns the original result. This single property converts "retries are dangerous" into "retries are safe," which is exactly the posture you want from an autonomous caller.

Errors deserve the same rigor as success. A human reads "something went wrong, please try again later" and uses judgment. An agent needs a stable, machine-readable error contract: a typed error code it can switch on, a category telling it whether the error is retryable and why, and where relevant a structured hint about what to fix. "Invalid argument: field 'region' must be one of [...]" lets the agent correct itself on the next call. A 500 with a stack trace in HTML teaches it nothing. Distinguish the classes that demand different behavior - retryable transient failures, permanent bad-input failures, and authorization failures - because the right next action differs for each, and the agent can only pick correctly if the error tells it which class it is in.

Granular, composable tools beat one do-everything endpoint

There is a temptation to build one powerful endpoint that takes a big options object and does the whole job, because that is convenient for a human integrator who reads the docs once. For an agent it is the wrong shape. A monolithic tool with twenty optional parameters is one the model is far more likely to fill in wrong, because it has to commit to every choice up front with no chance to inspect intermediate state and adjust.

Granular, composable tools work better: a tool to list, a tool to fetch one thing by ID, a tool to perform one action. The agent composes them - list, inspect a result, decide, act - and the structured output of each step informs the next. Each tool should do one thing, with a name and a description written for the model, not lifted from internal jargon. Keep the parameter surface small and typed; prefer required arguments with clear types over a pile of optional flags whose interactions the model has to reason about. And design tools so their outputs feed each other: if your list tool returns IDs in exactly the form your fetch tool expects, chaining is trivial; if it returns a display string the agent has to massage back into an ID, you have planted a parsing bug.

There is a real tension here with cost. More granular tools mean more round trips, and each round trip is a model call with latency and tokens. The answer is not one giant tool; it is well-chosen composition - tools granular enough to be reliable, coarse enough that common tasks do not take fifteen hops - and tool results scoped to what the next decision needs rather than dumping everything and making the model pay to read it.

Affordances: the schema is the prompt

For an agent, the tool definition is not documentation that sits beside the system - it is part of the prompt, and it is doing the teaching. This reframes API design as prompt design. The names, descriptions, field comments, and examples in your schema are the model's only guide to using the tool correctly, so write them for that reader. A description that says "returns the resource" helps no one; "returns the resource by ID; raises not_found if the ID does not exist; the returned 'state' field is one of active, suspended, deleted" tells the model exactly how to use the result and handle the edges.

Build affordances that keep the agent on rails. Make the correct path the obvious one and the dangerous path require an explicit, named argument. Validate inputs strictly and return a typed error that names the fix, so a wrong call becomes a self-correcting call rather than a silent bad write. Where an operation is irreversible, say so in the description and consider requiring a confirmation token, because an agent will not hesitate the way a human would. The aim is an interface where the model almost cannot hold it wrong - where the schema itself nudges most calls toward a well-formed shape.

None of this requires a smarter model. It requires moving the work from inference to contract: say what you mean in types the caller cannot misread, make actions safe to repeat, fail in a language the caller can act on, and let small tools compose. Do that and the same agent that flailed against a human UI starts chaining calls reliably - because you finally built the interface for the caller you actually have.

Takeaways

  • Design for the caller you have. An LLM does not read between the lines - it acts on a schema and returned bytes. The test for any agent-facing tool: could a caller who takes every string literally and cannot ask a question use this correctly?
  • Return typed, structured data, not prose. Enumerate states, keep field shapes stable, and make empty an explicit typed case. Every value the agent branches on should be a field it reads, not a phrase it interprets.
  • Make writes idempotent and errors machine-readable. An idempotency key turns dangerous retries into safe ones; a typed error code with a retryable category and a fix hint turns a failed call into a self-correcting one.
  • Prefer granular, composable tools over one do-everything endpoint, but watch the cost: each tool is a round trip. Aim for tools granular enough to be reliable and coarse enough that common tasks do not take fifteen hops, with outputs that feed the next call cleanly.
  • For an agent, the tool schema is the prompt. Names, descriptions, and field comments are doing the teaching - write them for the model, make the safe path the obvious one, and the model almost cannot hold the tool wrong.
← more notes nprabhav111@gmail.com