chapter five

5 RAG evaluation: Accuracy, relevance, and faithfulness

This chapter covers

The need and requirements for evaluating RAG pipelines
Metrics, frameworks, and benchmarks for RAG evaluation
Current limitations and future course of RAG evaluation

Chapters 3 and 4 discussed the development of retrieval-augmented generation (RAG) systems using the indexing and generation pipelines. RAG promises to reduce hallucinations and ground the large language model (LLM) responses in the provided context, which is done by creating a non-parametric memory or knowledge base for the system and then retrieving information from it.

This chapter covers the methods used to evaluate how well the RAG system is functioning. We need to make sure that the components of the two RAG pipelines are performing per the expectations. At a high level, we need to ensure that the information being retrieved is relevant to the input query and that the LLM is generating responses grounded in the retrieved context. To this end, there have been several frameworks developed over time. Here we discuss some popular frameworks and the metrics they calculate.

5.1 Key aspects of RAG evaluation

5.1.1 Quality scores

5.1.2 Required abilities

5.2 Evaluation metrics

5.2.1 Retrieval metrics

5.2.2 RAG-specific metrics

5.3 Frameworks

5.3.1 RAGAs

5.3.2 Automated RAG evaluation system

5.4 Benchmarks

5.4.1 RGB

5.5 Limitations and best practices

Summary

RAG evaluation fundamentals

Evaluation metrics

Evaluation frameworks

Benchmarks

Limitations and best practices