Most RAG systems get evaluated like this: the team builds it, the founder asks five questions, the answers look good, ship. Three weeks later customers complain that the answers are wrong, and the team has no way to tell which part of the pipeline broke.
Proper RAG evaluation isn't optional. It's the difference between shipping a system that works on average and one that fails the long tail of real questions.
This post is the practical playbook. Two evaluation axes, one golden dataset, and the metrics that actually catch bugs.
Why "the answers look good" isn't enough
RAG has at least three distinct failure modes, and they look identical from outside:
- Retrieval failed. The right document wasn't in the top-K. The model answered from its parametric memory, often wrong.
- Retrieval worked, but generation ignored it. The right doc was in context, but the model answered from priors anyway.
- Retrieval worked, generation used it, but the doc itself was wrong / outdated. This is a content problem, not a system problem.
When the founder eyeballs five answers, they can't tell which mode is failing. Each requires a different fix. So you need to evaluate retrieval and generation separately.
Build the golden dataset first
Nothing matters until you have one. A golden dataset for RAG is a list of triples:
{ question, ideal_answer, ideal_source_document_ids }
Minimum viable size: 30 questions. Sweet spot: 100-300. Mix:
- Easy questions. Single-doc, exact-phrase matches. (Sanity check.)
- Medium. Multi-doc synthesis, paraphrasing required. (The bulk of real usage.)
- Hard. Multi-hop, contradicting sources, time-bound facts. (Reveals system limits.)
- Out-of-scope. Questions your corpus can't answer. The system should say "I don't know," not hallucinate.
Writing these is annoying. It's also the most leveraged work in the project. A team without a golden set is flying blind.
Where to source: real user questions from logs, customer support tickets, the FAQ that already exists. If you have no users yet, write 30 questions yourself sitting in front of the corpus.
Retrieval evaluation: did the right docs come back?
This is the most underrated part. Run every question through retrieval (not generation), check whether the ideal source IDs appear in the top-K results.
Three metrics:
- Recall@K. Of the K results returned, does the right document appear at all? If recall@5 = 60%, you're losing 40% of answers before the LLM even sees the question.
- Precision@K. Of K results, how many are relevant? Low precision = LLM is wading through junk.
- MRR (Mean Reciprocal Rank). If the right doc is in position 1, score = 1.0. Position 2, score = 0.5. Position 5, score = 0.2. Captures "how high up was it." The closer to 1.0, the better.
In 2026, recall@10 below 80% on a well-curated corpus means your retrieval needs work — likely chunk size, embedding model, or hybrid search.
Generation evaluation: given the docs, was the answer right?
Isolate generation by feeding the ideal source documents directly to the model (not retrieved ones) and grading the output. This tells you whether the LLM can use context that's definitely correct.
Three axes to grade:
- Faithfulness. Does the answer only contain claims supported by the provided docs? Hallucinations here are catastrophic. Use an LLM judge: "Given these docs and this answer, does every factual claim appear in the docs? Answer YES / NO with reasoning."
- Answer relevance. Does the answer address the question? "What's the refund policy?" answered with "We have a 30-day window" — relevant. Answered with "We care about customers" — not relevant.
- Completeness. Does it cover everything the ideal answer covers? Partial-credit: pull the key claims from the ideal answer, check how many are in the model's answer.
Once you separate retrieval and generation, you can localize bugs. Retrieval failing? Fix chunking, reranking, or query expansion. Generation failing despite good context? Try a stronger model or improve the system prompt.
End-to-end metrics
Beyond the per-stage metrics, track:
- End-to-end accuracy. Auto-grade by LLM-as-judge or human review. The number you actually care about.
- Hallucination rate. Per 100 answers, how many contain a claim that isn't in the source docs. Target < 2%.
- Refusal rate on out-of-scope. When the corpus can't answer, does the system say so? Should be > 95%.
- Latency p50 / p95. Real users notice latency more than they notice 5% accuracy gains.
- Cost per query. Critical for unit economics.
Tools that already do this
You don't need to build the framework from scratch. As of 2026:
- Ragas. Open-source, the de facto standard. Implements faithfulness, answer relevance, context precision/recall. Pip install, point at your data, get scores.
- TruLens. Similar scope, more dashboard-y.
- Promptfoo. Generic LLM eval tool but has solid RAG support.
- DeepEval. Pytest-style assertions on LLM output. Good for CI.
- Langfuse / LangSmith. Tracing + eval combined; useful if you're already using them for observability.
For a real product: use Ragas or TruLens for batch evaluation, plus a 30-question regression set you run on every prompt or model change. Catch breakage before users do.
When NOT to invest in evaluation
- Pre-MVP. If you're still figuring out whether RAG is the right approach at all, eval is overkill. Get the first ugly version working, then evaluate.
- Internal-only tools used by 3 people. Just ask the users.
- The corpus changes every day and you have no time to update goldens. Then track aggregated metrics on real traffic instead — refusal rate, sources cited, user thumbs up/down — and skip the static eval set.
The CI pattern that actually catches regressions
Here's the workflow that works in practice:
- Maintain a 30-question regression set in version control as a JSON file.
- After every meaningful change (new chunking, new model, new prompt), run the regression set.
- Gate deploys on it: if accuracy drops > 5% or hallucination rate rises > 1%, block and investigate.
- Once a quarter, expand the set with 10 new questions sourced from real user complaints.
This catches 80% of regressions for 20% of the effort of "proper" evaluation infrastructure.
Further reading
- Ragas paper and docs.
- RAG vs Fine-tuning — Microsoft research piece.
- Look up: chunking strategies, hybrid search, reranking, LLM-as-judge calibration.