chapter ten
10 Evaluating agents
This chapter covers
- Why agent evaluation must be automated
- Concrete procedures for automating evaluation
- How to put automated evaluation to work in practice
An agent goes beyond merely suggesting an answer. It decides for itself what information to seek, which tools to use, in what sequence to execute tasks, and when to stop. This autonomy is a powerful advantage—but a single small mistake can cascade into real-world costs and risks: payments, procurement, permission changes, and external communications. That is why rigorous evaluation is essential: to prevent agents from causing unexpected harm and to protect users from potentially significant losses.