5 RAG Evaluation: Accuracy, Relevance, Faithfulness
This chapter covers
- The need and requirements for evaluating RAG pipelines
- Metrics, Frameworks and Benchmarks for RAG evaluation
- Current limitations and future course of RAG evaluation
In Chapter 3 & 4, we discussed the development of RAG systems via the Indexing and the Generation pipeline. The promise of RAG is to reduce hallucinations and ground the LLM responses in provided context. This is done by creating a non-parametric memory or knowledge base for the system and then retrieving information from this knowledge base.
In this chapter, we will cover the methods to evaluate how well the RAG system is functioning. We need to make sure that the components of the two RAG pipelines are performing in accordance with the expectations. At a high level, we need to make sure that the information that is being retrieved is relevant to the input query and that the LLM is generating responses that are grounded in the retrieved context. To this end there have been several frameworks that have been developed over time. We will discuss some popular frameworks and the metrics that they calculate.