chapter eight

8 RAG application evaluation

This chapter covers

Benchmarking RAG applications and agent capabilities
Designing evaluation datasets
Applying RAGAS metrics: recall, faithfulness, correctness

In this chapter, you will explore the importance of evaluating your RAG application performance using carefully constructed benchmark questions. As your RAG pipeline grows more sophisticated and complex, it becomes essential to ensure that your agent’s answers remain both accurate and coherent across a wide range of queries. A benchmark evaluation provides the system needed to measure the agent’s capabilities while also helping to clearly define and scope the agent.

Evaluating RAG applications involves multiple approaches, each addressing different steps of the application, as shown in figure 8.1, which illustrates a high-level overview of a pipeline for a question-answering system powered by an LLM with retrieval capabilities. It begins with the user posing a question to the system. The LLM then identifies the most suitable retrieval tool to fetch the necessary information. This step is critical and can be evaluated for the accuracy of the tool selection process.

Figure 8.1 Evaluating different steps of a RAG pipeline

Throughout this book, you have implemented various retrieval tool designs, starting with vector search and progressing to more structured approaches like text2cypher and Cypher templates. Each retrieval method serves different needs:

8 RAG application evaluation

This chapter covers

Figure 8.1 Evaluating different steps of a RAG pipeline

8.1 Designing the benchmark dataset

8.1.1 Coming up with test examples

8.2 Evaluation

8.2.1 Context recall

8.2.2 Faithfulness

8.2.3 Answer correctness

8.2.4 Loading the dataset

8.2.5 Running evaluation

8.2.6 Observations

8.3 Next steps

Summary