chapter ten

10 Evaluating agents

This chapter covers

Why agent evaluation must be automated
Concrete procedures for automating evaluation
How to put automated evaluation to work in practice

An agent goes beyond merely suggesting an answer. It decides for itself what information to seek, which tools to use, in what sequence to execute tasks, and when to stop. This autonomy is a powerful advantage—but a single small mistake can cascade into real-world costs and risks: payments, procurement, permission changes, and external communications. That is why rigorous evaluation is essential: to prevent agents from causing unexpected harm and to protect users from potentially significant losses.

10.1 Observing an agent

10.1.1 Pillars of observability: Metrics, traces, and logs

10.1.2 Generating, collecting, and exporting telemetry: OpenTelemetry

10.2 Building datasets and establishing evaluation criteria

10.2.1 What should we evaluate?

10.2.2 Creating a dataset

10.2.3 Analyzing errors

10.2.4 Designing rubrics and metrics

10.3 Evaluating with LLM-as-a-Judge

10.3.1 Type of LLM-as-a-Judge

10.3.2 Building a rubric-based evaluation system

10.4 Operations: CI/CD and continuous improvement

10.4.1 Evaluation-gated deployment

10.4.2 Improving the test set and evaluator: Agent quality flywheel

10.5 Summary