chapter fourteen

14 Evaluating RAG systems

This chapter covers

Classical retrieval metrics and why they mislead in RAG
The RAGAS evaluation triad: faithfulness, relevance, context
LLM-as-judge techniques from G-Eval through ARES
Generating synthetic evaluation datasets from your corpus
Building evaluation pipelines that diagnose failure points

By 2024, the RAG community had developed dozens of techniques for improving retrieval and generation: query expansion, reranking, context compression, reflection tokens, and corrective mechanisms. What it lacked was a reliable way to measure whether any of them worked. Teams would swap embedding models, tune chunk sizes, and rewrite system prompts, then evaluate the results by eyeballing a handful of cherry-picked queries. The result was a system that looked good in demos but embarrassingly fell apart in production.

14.1 Classical retrieval metrics and their limits

14.1.1 The foundation: precision and recall

14.1.2 The correlation gap: When good retrieval metrics produce bad answers

14.2 The RAGAS evaluation framework

14.2.1 Faithfulness: Does the answer stick to the evidence?

14.2.2 Answer relevance: Does the answer address the question?

14.2.3 Context relevance: Did retrieval fetch what was needed?

14.2.4 Putting the triad to work

14.2.5 Limitations and caveats

14.2.6 Beyond the triad: When ground-truth still earns its keep

14.3 LLM-as-judge: From G-Eval to trained evaluators

14.3.1 G-Eval: Chain-of-thought meets evaluation

14.3.2 ARES: Trained judges with statistical guarantees

14.3.3 The bias problem: What LLM judges get wrong

14.4 Building your evaluation dataset

14.4.1 Public benchmarks: Useful but skewed

14.4.2 Generating balanced evaluation data from your corpus

14.4.3 Practical tools for synthetic evaluation

14.5 Implementing an evaluation pipeline

14.5.1 A faithfulness judge

14.5.2 Mapping scores to failure points

14.5.3 Practical considerations at scale

14.6 Case study: Continuous evaluation for a compliance assistant

14.6.1 Establishing a baseline

14.6.2 The reranking fix and its side effect

14.6.3 Targeted improvement and monitoring

14.6.4 The evaluation flywheel

14.7 Summary