9 Evaluation and Performance for LLMs and Agents
This chapter covers
- Measuring and identifying hallucinations using metrics like FActScore, ROUGE, and LLM-as-judge approaches
- Implementing red teaming and stress testing strategies for robust AI systems
- Building production-ready monitoring with frameworks like Arize AI and Phoenix
- Implementing core architectural patterns: token streaming, batching, semantic caching, and multi-model fallback
- Evaluating agents using trajectory analysis and end-to-end testing
Throughout this book, we have built increasingly sophisticated LLM applications. We started with basic prompt engineering, moved to structured outputs and function calling, added retrieval-augmented generation for grounding responses in real data, and in the previous chapter constructed multi-agent systems where specialized agents collaborate on complex tasks. Each layer added capability. Each layer also added ways for things to go wrong.
A prompt can be poorly written. A structured output can fail to parse. A retrieval system can fetch irrelevant documents. An agent can call the wrong tool or loop indefinitely. But the most insidious failure cuts across all of these: hallucination. Unlike crashes or error messages, hallucinations do not announce themselves. They slip past users and damage trust before anyone notices.