chapter fourteen

14 Evaluating RAG systems

 

This chapter covers

  • Classical retrieval metrics and why they mislead in RAG
  • The RAGAS evaluation triad: faithfulness, relevance, context
  • LLM-as-judge techniques from G-Eval through ARES
  • Generating synthetic evaluation datasets from your corpus
  • Building evaluation pipelines that diagnose failure points

By 2024, the RAG community had developed dozens of techniques for improving retrieval and generation: query expansion, reranking, context compression, reflection tokens, and corrective mechanisms. What it lacked was a reliable way to measure whether any of them worked. Teams would swap embedding models, tune chunk sizes, and rewrite system prompts, then evaluate the results by eyeballing a handful of cherry-picked queries. The result was a system that looked good in demos but embarrassingly fell apart in production.

14.1 Classical retrieval metrics and their limits

14.1.1 The foundation: precision and recall

14.1.2 The correlation gap: When good retrieval metrics produce bad answers

14.2 The RAGAS evaluation framework

14.2.1 Faithfulness: Does the answer stick to the evidence?

14.2.2 Answer relevance: Does the answer address the question?

14.2.3 Context relevance: Did retrieval fetch what was needed?

14.2.4 Putting the triad to work

14.2.5 Limitations and caveats

14.2.6 Beyond the triad: When ground-truth still earns its keep

14.3 LLM-as-judge: From G-Eval to trained evaluators

14.3.1 G-Eval: Chain-of-thought meets evaluation

14.3.2 ARES: Trained judges with statistical guarantees

14.3.3 The bias problem: What LLM judges get wrong

14.4 Building your evaluation dataset

14.4.1 Public benchmarks: Useful but skewed

14.4.2 Generating balanced evaluation data from your corpus

14.4.3 Practical tools for synthetic evaluation

14.5 Implementing an evaluation pipeline

14.5.1 A faithfulness judge

14.5.2 Mapping scores to failure points

14.5.3 Practical considerations at scale

14.6 Case study: Continuous evaluation for a compliance assistant

14.6.1 Establishing a baseline

14.6.2 The reranking fix and its side effect

14.6.3 Targeted improvement and monitoring

14.6.4 The evaluation flywheel

14.7 Summary