chapter five

5 RAG Evaluation: Accuracy, Relevance, Faithfulness

This chapter covers

The need and requirements for evaluating RAG pipelines
Metrics, Frameworks and Benchmarks for RAG evaluation
Current limitations and future course of RAG evaluation

In Chapter 3 & 4, we discussed the development of RAG systems via the Indexing and the Generation pipeline. The promise of RAG is to reduce hallucinations and ground the LLM responses in provided context. This is done by creating a non-parametric memory or knowledge base for the system and then retrieving information from this knowledge base.

In this chapter, we will cover the methods to evaluate how well the RAG system is functioning. We need to make sure that the components of the two RAG pipelines are performing in accordance with the expectations. At a high level, we need to make sure that the information that is being retrieved is relevant to the input query and that the LLM is generating responses that are grounded in the retrieved context. To this end there have been several frameworks that have been developed over time. We will discuss some popular frameworks and the metrics that they calculate.

5.1 Key Aspects of RAG evaluation

5.1.1 Quality Scores

5.1.2 Required Abilities

5 RAG Evaluation: Accuracy, Relevance, Faithfulness

This chapter covers

5.1 Key Aspects of RAG evaluation

5.1.1 Quality Scores

5.1.2 Required Abilities

5.2 Evaluation Metrics

5.2.1 Retrieval Metrics

5.2.2 RAG specific metrics

5.3 Frameworks

5.3.1 RAGAs

5.3.2 ARES

5.4 Benchmarks

5.5 Limitations and Best Practices

5.6 Summary