LLM as a Judge: Quality Gates That Are Not Vibes

2026·6 min read

evalsllmquality gates

A separate model can score outputs you cannot label by hand - but only if you treat it like a measuring instrument: a few binary dimensions, a calibrated reading, an answer key it never gets to see.

flowchart LR
  G["generator
makes output"] --> J["judge
separate model"]
  J -->|"pass"| SH["ship"]
  J -->|"fail"| BL["block / retry"]
  classDef key fill:#e8f1fb,stroke:#1e1e1e,color:#1e1e1e
  classDef good fill:#eaf6ec,stroke:#1e1e1e,color:#1e1e1e
  classDef hot fill:#fde7e7,stroke:#1e1e1e,color:#1e1e1e
  class J key
  class SH good
  class BL hot
  linkStyle 1 stroke:#2f9e44,color:#2f9e44
  linkStyle 2 stroke:#c92a2a,color:#c92a2a

A separate judge (never the generator) scores a few binary dimensions and gates delivery. It is calibrated against human labels and never sees the answer key, so it measures judgment, not lookup.

Why grade with a model at all

Some outputs have a cheap, exact check - a schema that parses, a number that matches, a test that passes - and for those you should never reach for a model. You reach for a judge when correctness is real but not mechanically checkable: did this summary stay faithful to its source, did this answer actually address the question, did this explanation avoid asserting a fact the inputs do not support. Human review can score these, but it does not scale to every output on every prompt change, so most teams fall back to reading a handful of samples and shipping on a good feeling. That is the failure mode worth naming. "Looks fine" is not a measurement, it does not run in CI, and it cannot tell you whether yesterday's prompt edit quietly made things worse. A second model, prompted only to score, turns that vibe into a number you can gate on and track over time. The judge is not there to be smart. It is there to be a consistent, cheap, repeatable instrument - one that gives you the same reading on the same input far more often than a human re-reading samples will.

Score a few binary dimensions, never let the model grade itself

The first rule is structural: the model that produced the output must not be the model that grades it, and ideally not even the same prompt. Self-grading measures a model's confidence in its own work, which is exactly the thing that is already broken when the work is wrong. Use a separate judge call - a different model or at minimum a clean prompt with no memory of how the answer was generated - so the score reflects the artifact, not the author's self-regard. The second rule is to keep the rubric small and binary. A judge asked to rate quality from one to ten will hand back noise dressed as precision, and the same output can land on different scores from one run to the next. Decompose into a few yes-or-no dimensions instead: is every claim supported by the provided source material, does it answer the question that was actually asked, does it avoid contradicting the inputs, is the format valid. Each is a binary call a judge can make consistently and a human can adjudicate when you check it. Three or four sharp binary dimensions beat one fuzzy scalar every time, because you can see which specific property failed rather than watching an aggregate sag for reasons no one can name.

Wire it inline, and keep the answer key out of its sight

A judge that runs offline in a dashboard catches regressions after they ship. A judge wired inline catches them before delivery. The pattern is a gate: generate the output, score it on the binary dimensions, and if it falls below threshold on a dimension that matters, block it - retry with an adjusted prompt, fall back to a safer response, or route to a human - rather than handing the user something that failed the check. This costs a second model call per request, so reserve the inline gate for paths where shipping a bad output is expensive, and run the cheaper offline version everywhere else. The non-negotiable discipline, exactly as with any eval, is anti-contamination. The judge must score against criteria, not against a leaked answer. If you are evaluating faithfulness, the judge sees the source and the output and rules on whether one supports the other - it does not see a reference answer it can pattern-match to. The moment the expected answer is visible to the judge, in its prompt, in a tool result, or as a few-shot example, you are measuring lookup, not judgment, and the gate will pass things it should have stopped. Lowering the judge's sampling temperature makes its scores far more stable run to run, which is what you need when grading repeatedly against the same dimensions. That is a reproducibility lever, not a correctness fix, and it is not a claim that the judge no longer produces wrong verdicts.

Calibrate against humans, then close the loop on the prompt

A judge is an instrument, and an instrument you have not checked against a known reference is just producing numbers you have no reason to trust. Before you trust a gate, take a small set of outputs, have a human score the same binary dimensions, and compare. Where the judge and the human disagree, you learn something specific: the judge is too lenient on one dimension, too harsh on another, or it has quietly invented a criterion you never asked for. Split the human-labeled set up front: tune the judge's rubric and prompt against one part until its calls track human calls closely enough to trust, and reserve the other part, never touched during tuning, purely to confirm the calibration held - the same held-out discipline that keeps any eval honest. Once the judge is calibrated, it becomes the engine of an improvement loop on the thing being judged. The failures it flags are not random; they cluster. Read the blocked outputs, find the recurring pattern - a class of question the prompt mishandles, a context the model keeps ignoring - enrich the generation prompt to address that specific pattern, then A/B the new prompt against the old one on the judge's dimensions before shipping. Diagnose, enrich, validate against the gate, ship. The judge tells you whether the change actually moved the dimension you targeted or just moved the failures somewhere else.

Where a judge lies to you, and where not to use one

A judge has failure modes of its own, and a gate you trust blindly is worse than no gate. Judges carry bias: many will reward longer, more confident, more elaborately formatted answers regardless of whether they are correct, so a verbose wrong answer can outscore a terse right one - watch for length and style bias explicitly and, where you can, score on properties that do not correlate with verbosity. Judges can be gamed, especially when generator and judge share a model family and you tune the generation prompt against the judge's score - the prompt learns the judge's preferences rather than the underlying quality, so the number climbs while the artifact does not. And judges drift: a model version changes under you, the input distribution shifts, and a rubric that was calibrated six months ago no longer means what its numbers say, which is why recalibration against fresh human labels has to be a standing cadence and not a launch-day task. A judge in this shape is not a retrieval system built on vector embeddings or semantic search; it is a model handed the source and the output and asked a structured scoring question, reasoning over exactly what it was given - and the more it has to fetch on its own, the less the score reflects the artifact in front of it. Finally, know where a judge does not belong. If a cheap deterministic check exists, use it - do not pay a model to confirm what a parser already knows. If the output is high-stakes enough that a wrong pass is unacceptable, the judge gates toward a human, it does not replace one. And if you cannot write the dimension as something a human could adjudicate and a judge could be calibrated against, you do not have a quality gate yet - you have a vibe with a number attached, which is exactly what the gate was supposed to replace.

Takeaways

Use a judge model only where correctness is real but not mechanically checkable, and never where a cheap deterministic check already exists. The judge is a consistent measuring instrument, not a smarter reviewer.
Never let a model grade its own output. Use a separate judge call, and score a few sharp binary dimensions - supported by the source, answers the question asked, does not contradict the inputs, valid format - instead of one fuzzy one-to-ten scalar that drifts and hides which property failed.
Wire the judge inline as a gate that blocks sub-threshold outputs before delivery on expensive paths, and enforce anti-contamination: the judge scores against criteria, never against a leaked answer key in its prompt, tools, or few-shot examples. Low temperature buys more stable scores run to run, not correct verdicts.
Split the human labels up front: calibrate the judge against one part before trusting it, reserve the other part untouched to confirm the calibration held, then run a diagnose-enrich-A/B-ship loop on the generation prompt using the judge's flagged failure clusters as the signal.
Watch for the judge's own failures - length and style bias, gaming when generator and judge share a family and you optimize the prompt against its score, and drift as models and inputs change - and recalibrate on a cadence. If a dimension cannot be adjudicated by a human and calibrated against, it is not a gate, it is a vibe with a number attached.