chapter nine

9 Evaluation and Performance for LLMs and Agents

 

This chapter covers

  • Measuring and identifying hallucinations using metrics like FActScore, ROUGE, and LLM-as-judge approaches
  • Implementing red teaming and stress testing strategies for robust AI systems
  • Building production-ready monitoring with frameworks like Arize AI and Phoenix
  • Implementing core architectural patterns: token streaming, batching, semantic caching, and multi-model fallback
  • Evaluating agents using trajectory analysis and end-to-end testing

Throughout this book, we have built increasingly sophisticated LLM applications. We started with basic prompt engineering, moved to structured outputs and function calling, added retrieval-augmented generation for grounding responses in real data, and in the previous chapter constructed multi-agent systems where specialized agents collaborate on complex tasks. Each layer added capability. Each layer also added ways for things to go wrong.

A prompt can be poorly written. A structured output can fail to parse. A retrieval system can fetch irrelevant documents. An agent can call the wrong tool or loop indefinitely. But the most insidious failure cuts across all of these: hallucination. Unlike crashes or error messages, hallucinations do not announce themselves. They slip past users and damage trust before anyone notices.

9.1 Identifying and measuring hallucinations

9.1.1 Four steps to identify and measure hallucinations

9.1.2 FActScore: Fine-grained factual evaluation

9.1.3 ROUGE metric for summarization evaluation

9.1.4 LLM as a judge: A holistic approach

9.1.5 Red teaming and stress testing

9.1.6 Using monitoring frameworks for hallucination detection

9.2 Essential architectural patterns for performance

9.2.1 Token streaming: Presenting answers incrementally or all at once

9.2.2 Handling surges with batching—System-level vs. OpenAI’s Batch API

9.2.3 Caching for efficiency

9.2.4 Multi-model fallback: Matching each query to the right model

9.2.5 Project: Building an e-commerce LLM service with batching, caching, and model fallback

9.2.6 Further improvements

9.3 Evaluating agent performance

9.3.1 Core metrics for evaluating LLMs

9.3.2 Evaluating agents: Beyond traditional LLM metrics

9.4 Summary

9.5 References