How to evaluate LLM output quality at scale

If you've shipped any LLM product, you've hit the same wall: manual quality review doesn't scale. Reading 50 conversations a week is a meaningful sample of 1000/day traffic. By 100,000/day, you can't even read the failures, let alone the average cases. You need automation.

LLM evaluation at scale is its own discipline. Three flavors that work in 2026, and the rules for picking among them.

The three flavors

1. Golden dataset evaluation (offline)

A fixed set of inputs, each paired with an ideal answer (or a rubric for what "good" looks like). Run your model against the set, score each output, aggregate.

Used for: regression testing, prompt iteration, model comparison.

Strengths: deterministic, comparable across runs, fast to iterate on.

Weaknesses: only as good as the dataset. Small or biased golden sets give false confidence. Distribution drift in real traffic isn't reflected.

2. LLM-as-judge

Use a strong LLM (Claude Opus, GPT-5) to evaluate other LLM outputs against a rubric.

Prompt:
"You are evaluating an answer for accuracy and helpfulness.
Question: {q}
Answer: {a}
Rubric: {rubric}
Return JSON: { score: 1-5, reasoning: string }"

Used for: scoring outputs at scale, especially for subjective qualities (helpfulness, tone, faithfulness) where exact-match comparison fails.

Strengths: scales to millions of outputs cheaply (Haiku as judge is $0.25/M input). Captures nuanced quality dimensions.

Weaknesses: judge has biases (favors verbosity, formal tone, its own outputs in pairwise comparisons). Calibration drifts when model versions change. Expensive if you use Opus as judge ($90/M output).

3. Online metrics (real users)

Measure user behavior: thumbs up/down, conversation length, retry rate, time-to-resolution, copy-paste rate, abandonment.

Used for: ground truth on real-world quality. The numbers users actually feel.

Strengths: real signal. No artificial dataset. Captures quality dimensions you didn't think to measure.

Weaknesses: noisy. Slow to detect changes. Doesn't tell you which prompt or model change caused regression. Can't catch issues before users see them.

A decision tree

Which do you use? Depends on what you're testing:

"Did my prompt change break anything?" → Golden dataset. Run before/after, gate on regressions.
"Is GPT-5 better than Sonnet for my task?" → LLM-as-judge on a few thousand real queries.
"Is my product getting worse over time?" → Online metrics dashboard.
"What kinds of questions do we fail on?" → LLM-as-judge to score real traffic, then human review of low-scored outputs.
"Is this hallucination rate acceptable?" → Golden dataset with verifiable answers; complement with online thumbs-down.

The stack that scales is all three: golden for CI, LLM-as-judge for traffic sampling, online metrics for ground truth.

Building the golden dataset (right)

The most common mistake: a 30-question set written by one engineer in an hour. It's biased, narrow, and wrong-distribution.

Good goldens:

Sourced from real traffic when possible (sanitize PII). Real users ask things you wouldn't think of.
Stratified by category, difficulty, and domain. 30 easy + 30 medium + 30 hard, not 90 random.
Includes failure cases the team has seen — bug reports, support tickets, complaints.
Has ideal answers OR rubrics. Some questions don't have one right answer (creative writing); a rubric ("is the response engaging? is it factually grounded? is it the right length?") is more flexible.
Reviewed by 2+ people before being used. One reviewer's bias becomes the team's bias.
Refreshed quarterly with new examples from recent traffic.

Size: 50-200 examples is usually plenty for a single product. Larger isn't always better — a 5,000-example golden takes forever to run after every prompt change.

LLM-as-judge in practice

The judge prompt is the most important piece. It needs:

Clear rubric. Don't say "is this good?" Say "score 1-5 based on (a) accuracy of facts, (b) addresses the user's actual question, (c) appropriate length, (d) safe and respectful tone."
Few-shot examples. Show the judge what a 5 looks like and what a 2 looks like. Calibration improves dramatically.
Reasoning before score. Ask for the score after the reasoning, not before. "Reason then conclude" gives more reliable scores than "score then justify."
Structured output. Use JSON mode or tool use. Strings like "this is a 4 maybe?" are unparseable.

A tip that helps a lot: calibrate the judge against humans. Take 100 outputs, have humans score them, have your judge score them. Compare. If correlation is < 0.7, your rubric is unclear or the judge model is weak. Fix the rubric or upgrade the judge.

Common LLM-as-judge biases

Known failure modes:

Verbosity bias. Longer answers score higher even when shorter is better.
Position bias (in pairwise comparison). The first answer shown wins more often. Mitigation: randomize order; or run both orderings and average.
Self-preference. GPT-5 judges think GPT-5 outputs are better. Avoid using the same model as judge for outputs from itself.
Stylistic bias. Formal tone scores higher in some rubrics regardless of accuracy.
Over-confidence on confident-sounding wrong answers. A confidently wrong answer often scores higher than a hedging-but-correct one.

Most of these are mitigated by careful rubric design plus calibration against humans.

Online metrics that work

The four signals worth dashboarding:

Thumbs up/down on each response. Add the buttons; ignore the absolute number, watch the rate over time.
Retry rate. What fraction of conversations involve the user re-asking the same question (paraphrased)? Rising = quality dropping.
Conversation length on "task done" outcomes. Did users solve their problem in 2 turns or 8?
Specific feature use (in product): if you ship a "copy answer" button, copy rate is a strong signal.

Dashboard these per cohort (model, prompt version, user segment). When something changes, you have the data to localize the regression.

A scaling pattern: 3-tier evaluation

The stack I've seen work for serious products:

Pre-deploy: golden dataset of 100 questions. Run on every prompt change. Block deploy if quality drops > 3% or hallucination rate rises > 1%. CI gate.
Real-time sampling: LLM-as-judge on 5% of production traffic, scored daily. Flag conversations scoring < 3/5 for human review weekly.
Steady state: online metrics dashboard. Per-model thumbs rate. Trends over weeks.

This catches different issues at different latencies: golden catches obvious regressions immediately, sampling catches drift on real distribution within days, online metrics catches slow degradation over weeks.

Cost

Quick math for a product with 10,000 conversations/day:

Golden eval per release (100 questions × Sonnet eval): $0.50/run. Free.
5% LLM-as-judge sampling (500 conversations/day × Haiku judge): $0.20/day = $73/year. Free.
Human review of flagged low-scores (1 hour/week of someone's time): the real cost.

Quality eval is one of the cheapest investments you'll make in your LLM stack. Most teams skip it because they think it's expensive. It isn't.

When NOT to evaluate

Pre-MVP. You don't know what "good" is yet. Get something shippable, then evaluate.
No traffic. With < 100 conversations/day you can read them all manually.
Single-user internal tool. Just ask the user.

The trigger to invest: when prompt changes start feeling scary because you can't tell if they helped or hurt.