chapter seven
7 Reflection and evaluation: How your agent audits and learns from itself
This chapter covers
- Diagnosing why intrinsic self-correction fails and what external signals make it work
- Refining current outputs through tool-grounded Generator-Critic loops
- Extracting reusable capabilities with Skill Package and the SKILL.md ecosystem
- Learning across sessions through Experience Replay at three abstraction levels
- Recovering from deterministic failures through Self-Heal Loops that close the test-driven feedback cycle
- Evaluating agent systems with the five-layer evaluation stack that turns scattered metrics into a methodology
"An unexamined life is not worth living."
— Socrates
The code review agent had flagged a race condition in a connection pool—a real bug, well-analyzed, with a correct fix. I was impressed enough to ask it to review its own fix. It found three new issues. One was legitimate (a missing error handler). The other two were phantom: it declared a thread-safe data structure "not thread-safe" and a standard library function "missing in this Python version." I applied the legitimate fix and ignored the phantoms, then asked it to review again. It surfaced two more phantom issues and reintroduced the thread-safety concern it had already raised. By iteration four, the review was worse than the original.