Evals Before Features: Benchmarking LLMs for a Production Task
Before you wire an LLM into a real workflow, decide how you will know it is good enough - because the eval is the gate, and the model is just the thing that has to pass it.
flowchart LR O["model output"] --> EV["eval gate
dimensions + held-out"] EV -->|"pass"| SH["ship"] EV -->|"fail"| FX["fix prompt"] FX -.-> O classDef key fill:#e8f1fb,stroke:#1e1e1e,color:#1e1e1e classDef good fill:#eaf6ec,stroke:#1e1e1e,color:#1e1e1e classDef hot fill:#fde7e7,stroke:#1e1e1e,color:#1e1e1e class EV key class SH good class FX hot linkStyle 1 stroke:#2f9e44,color:#2f9e44 linkStyle 2 stroke:#c92a2a,color:#c92a2a
Start with the eval, not the model
The default move when adding an LLM to a product is to pick a model, write a prompt, look at a few outputs, and ship if they look right. That works until the day a slightly different input produces a confidently wrong answer and you have no way to say whether it was a fluke or a pattern. The failure is not the model. It is that there was never a measurement the model had to pass before it was trusted.
The better order is to build the eval first. Before any feature work, write down what "good" means for this specific task, assemble a set of inputs with known correct answers, and decide the score threshold that earns the model a place in the product. Then the model selection, the prompt, and every later change all answer to the same gate. An eval built this way is cheap to run, so you run it on every prompt edit and every model version, and a regression shows up as a number going down rather than as a support ticket weeks later.
Define the dimensions before you measure anything
"Accuracy" is rarely one number for a real task. A model that gets the right answer but in the wrong format breaks the code that consumes it. A model that is right most of the time but catastrophically wrong on a small tail may be unusable if those tail cases are the expensive ones. So the first step is to decompose quality into dimensions you can score independently.
For most production tasks the dimensions are some mix of: correctness against a known answer, format and schema validity so the output parses every time, calibration so the model abstains or flags low confidence instead of guessing, latency and cost per call, and behavior on adversarial or out-of-distribution inputs. Score each separately. A single blended number hides exactly the tradeoff you most need to see - the model that wins on average correctness but loses on tail behavior, or the one that is slightly worse but half the cost and latency. Different tasks weight these differently, and writing the weights down is itself a design decision worth making on purpose.
Build a labeled set you actually trust, and keep it clean
The eval is only as good as its answer key. A labeled set assembled from whatever was easy to find will quietly encode the same blind spots the model has, and then agreement means nothing. So anchor labels to verifiable evidence rather than to someone's quick judgment - tie each expected answer to a concrete artifact you can point to, so a disputed label can be re-derived from the source instead of re-argued from memory.
The harder discipline is anti-contamination. If the items you grade against have leaked into what the model can see, you measure recall of the answer, not the reasoning that would produce it. For a hosted model you cannot inspect a training set, so the contamination you can actually control is in-context: the expected answer must never appear in the prompt, in the results of any tool the model calls, or as a few-shot example. Keep a held-out slice that is used only for scoring and never for prompt tuning, so the number you report is measured against inputs the current prompt was not fit to. The moment your tuning loop starts optimizing against the held-out slice, it stops being a held-out slice.
Run a cross-vendor bake-off on the same harness
Once the eval exists, comparing models becomes mechanical instead of anecdotal. Run every candidate - across vendors and across sizes within a vendor - through the identical harness, the same inputs, the same scoring, the same parsing of outputs. Vendor benchmarks are run on the vendor's tasks; yours is run on your task, which is the only one that matters for your decision.
The harness itself needs to be reproducible, which means pinning model versions and lowering sampling temperature for the graded runs so scoring is far more stable run to run. Lower temperature is a reproducibility lever, not a correctness fix - it makes scoring stable, it does not make a wrong model right. The bake-off output is a table, not a winner: this model leads on correctness but costs more and runs slower, that one is close on quality at a fraction of the latency, a third is strong on the common cases and weak on the adversarial slice. That table is what lets you choose deliberately, and re-choose later when a new model ships, because the gate has not moved even though the candidates have.
Ship behind a router, and keep the eval as a standing gate
The bake-off rarely produces one model that dominates on every dimension, and that is fine, because you do not have to pick one for everything. Put the chosen models behind a router that sends each request to the model that wins on the dimensions that request cares about - a cheap fast model for the common, low-stakes path, an expensive careful model for the inputs the eval flagged as hard or high-cost-of-error. The router is a policy you can tune, and the eval is what tells you whether a routing change helped or hurt.
The last and most important habit is to keep the eval running after launch. Models get deprecated and silently replaced, prompts get edited, the input distribution drifts as the product changes underneath you. A model that passed the gate once and is trusted forever decays without anyone noticing. Wire the eval into the release path so no prompt or model change ships without clearing the same bar, and re-score on a cadence against fresh labeled inputs. Treated this way, quality stops being a launch-day screenshot and becomes a maintained property - one that stays good as the model, prompts, and inputs change underneath it.
Takeaways
- Build the eval before the feature. Define what good means, assemble a trusted labeled set, and pick a threshold - then make model choice, prompts, and every later change answer to that one gate.
- Decompose quality into independent dimensions - correctness, format validity, calibration, latency, cost, tail and adversarial behavior - because a single blended number hides the exact tradeoff you need to see.
- Your labeled set is the answer key: anchor labels to verifiable evidence, and enforce anti-contamination by keeping expected answers out of the prompt, tool results, and few-shot examples, with a held-out slice used only for scoring.
- Run a cross-vendor bake-off on one reproducible harness with pinned versions and low temperature. The output is a tradeoff table, not a single winner - lower temperature buys reproducibility, not correctness.
- Ship behind a router that matches each request to the model that wins on the dimensions it cares about, and keep the eval running as a release gate so quality survives model deprecation, prompt edits, and input drift.