This chapter covers
- The need and requirements for evaluating RAG pipelines
- Metrics, frameworks, and benchmarks for RAG evaluation
- Current limitations and future course of RAG evaluation
Chapters 3 and 4 discussed the development of retrieval-augmented generation (RAG) systems using the indexing and generation pipelines. RAG promises to reduce hallucinations and ground the large language model (LLM) responses in the provided context, which is done by creating a non-parametric memory or knowledge base for the system and then retrieving information from it.
This chapter covers the methods used to evaluate how well the RAG system is functioning. We need to make sure that the components of the two RAG pipelines are performing per the expectations. At a high level, we need to ensure that the information being retrieved is relevant to the input query and that the LLM is generating responses grounded in the retrieved context. To this end, there have been several frameworks developed over time. Here we discuss some popular frameworks and the metrics they calculate.