chapter seven

7 Building robust agents with evaluation and feedback

This chapter covers

Introducing agent evaluation and feedback
Implementing test-driven agent development
Employing grounding, critic, and evaluation agents
Using Phoenix for evaluation and feedback

Evaluation and feedback provide the discipline that makes agent robustness measurable and improvable. They do not produce robustness on their own; a poorly architected agent will fail in ways that no evaluation suite can fix. What evaluation and feedback give you is visibility into how the system actually behaves and a mechanism for iterating toward better behavior over time.

Agent evaluation takes many forms, from benchmark and red team testing to grounding checks and agents that evaluate other agents. Feedback systems developed for agents come from human reviewers, evaluator agents, test outputs, and self-assessment. Each one answers a different question about the agent and occupies a distinct role in the development lifecycle.

While implementing evaluation and feedback is generally a requirement in any production agent system, this shouldn’t be the only time you look at hardening your agent systems. You almost always want to roll in this final layer (the fifth layer in figure 3.7, evaluation and feedback), which we will explore in this chapter.

7.1 Introducing agent evaluation and feedback

7.2 Implementing test-driven agent development

7.2.1 Exploring TDAD in practice

7.2.2 Coding and testing the RAG agent

7.2.3 Refactoring the agent

7.2.4 Extending evaluation with an agent evaluator

7.3 Employing grounding, critic, and evaluation agents

7.3.1 Reviewing the grounding agent

7.3.2 Grounding the RAG agent

7.3.3 Implementing grounding agents as guardrails

7.3.4 Understanding the role of rubrics in evaluation

7.3.5 Building a rubric critic agent

7.4 Phoenix for evaluation and feedback

7.4.1 Connecting to Phoenix

7.4.2 Adding metadata and session tracking

7.4.3 Experimenting with evaluators

7.4.4 Providing feedback with annotations

7.5 Exercises