chapter seven

7 Building robust agents with evaluation and feedback

 

This chapter covers

  • Introducing agent evaluation and feedback
  • Implementing test-driven agent development
  • Employing grounding, critic, and evaluation agents
  • Using Phoenix for evaluation and feedback

Building robust, reliable, safe and debuggable agentic systems is all about implementing evaluation and feedback. Agent evaluation comes in many forms, from benchmark testing, red team testing, grounding, and even agents that evaluate agents. Likewise, feedback systems developed for agents may come from human experience, agent evaluators or critics, testing output, and self-assessment.

While it is generally a requirement to implement evaluation and feedback into any production agent system, this shouldn’t be the only time you look at hardening your agent systems. You almost always want to roll in this final layer (Layer 5 - Evaluation and Feedback), which we will explore in this chapter.

7.1 Introducing agent evaluation and feedback

7.2 Implementing test-driven agent development

7.2.1 Exploring TDAD in practice

7.2.2 Coding and testing the RAG agent

7.2.3 Refactoring the agent

7.2.4 Extending evaluation with an agent evaluator

7.3 Employing grounding, critic, and evaluation agents

7.3.1 Reviewing the grounding agent

7.3.2 Grounding the RAG agent

7.3.3 Implementing grounding agents as guardrails

7.3.4 Understanding the role of rubrics in evaluation

7.3.5 Building a rubric critic agent

7.4 Phoenix for evaluation and feedback

7.4.1 Connecting to Phoenix

7.4.2 Adding metadata and session tracking

7.4.3 Experimenting with evaluators

7.4.4 Providing feedback with Annotations

7.5 Exercises

7.6 Summary