14 Evaluating RAG systems
This chapter covers
- Classical retrieval metrics and why they mislead in RAG
- The RAGAS evaluation triad: faithfulness, relevance, context
- LLM-as-judge techniques from G-Eval through ARES
- Generating synthetic evaluation datasets from your corpus
- Building evaluation pipelines that diagnose failure points
By 2024, the RAG community had developed dozens of techniques for improving retrieval and generation: query expansion, reranking, context compression, reflection tokens, and corrective mechanisms. What it lacked was a reliable way to measure whether any of them worked. Teams would swap embedding models, tune chunk sizes, and rewrite system prompts, then evaluate the results by eyeballing a handful of cherry-picked queries. The result was a system that looked good in demos but embarrassingly fell apart in production.
14.1 Classical retrieval metrics and their limits
14.1.1 The foundation: precision and recall
14.1.2 The correlation gap: When good retrieval metrics produce bad answers
14.2 The RAGAS evaluation framework
14.2.1 Faithfulness: Does the answer stick to the evidence?
14.2.2 Answer relevance: Does the answer address the question?
14.2.3 Context relevance: Did retrieval fetch what was needed?
14.2.4 Putting the triad to work
14.2.5 Limitations and caveats
14.2.6 Beyond the triad: When ground-truth still earns its keep
14.3 LLM-as-judge: From G-Eval to trained evaluators
14.3.1 G-Eval: Chain-of-thought meets evaluation
14.3.2 ARES: Trained judges with statistical guarantees
14.3.3 The bias problem: What LLM judges get wrong
14.4 Building your evaluation dataset
14.4.1 Public benchmarks: Useful but skewed
14.4.2 Generating balanced evaluation data from your corpus
14.4.3 Practical tools for synthetic evaluation
14.5 Implementing an evaluation pipeline
14.5.1 A faithfulness judge
14.5.2 Mapping scores to failure points
14.5.3 Practical considerations at scale
14.6 Case study: Continuous evaluation for a compliance assistant
14.6.1 Establishing a baseline
14.6.2 The reranking fix and its side effect
14.6.3 Targeted improvement and monitoring
14.6.4 The evaluation flywheel
14.7 Summary