chapter seven

7 Reflection and evaluation: How your agent audits and learns from itself

This chapter covers

Diagnosing why intrinsic self-correction fails and what external signals make it work
Refining current outputs through tool-grounded Generator-Critic loops
Extracting reusable capabilities with Skill Package and the SKILL.md ecosystem
Learning across sessions through Experience Replay at three abstraction levels
Recovering from deterministic failures through Self-Heal Loops that close the test-driven feedback cycle
Evaluating agent systems with the five-layer evaluation stack that turns scattered metrics into a methodology

"An unexamined life is not worth living."

— Socrates

The code review agent had flagged a race condition in a connection pool—a real bug, well-analyzed, with a correct fix. I was impressed enough to ask it to review its own fix. It found three new issues. One was legitimate (a missing error handler). The other two were phantom: it declared a thread-safe data structure "not thread-safe" and a standard library function "missing in this Python version." I applied the legitimate fix and ignored the phantoms, then asked it to review again. It surfaced two more phantom issues and reintroduced the thread-safety concern it had already raised. By iteration four, the review was worse than the original.

7.1 What is reflection? How agents evaluate, critique, and improve their own work

7.1.1 The self-correction paradox

7.1.2 Testing and observing reflection

7.2 Pattern: Generator-Critic

7.2.1 The generate-critique-revise loop

7.2.2 In Production: Claude Code's edit-test-fix loop

7.2.3 Building it

7.2.4 When it breaks

7.3 Pattern: Skill Package

7.3.1 From execution to reusable skill

7.3.2 In Production: Claude Code's SKILL.md ecosystem

7.3.3 Building it

7.3.4 When it breaks

7.4 Pattern: Experience Replay

7.4.1 Three levels of experience abstraction

7.4.2 In Production: Claude Code's auto-memory and failure journals

7.4.3 Building it

7.4.4 When it breaks

7.5 Pattern: Self-Heal Loop

7.5.1 The fail-diagnose-fix-verify loop

7.5.2 In Production: Spotify Honk's TDD-grounded loop at scale

7.5.3 Building it

7.5.4 When it breaks

7.6 Composing reflection patterns

7.7 The agent evaluation stack

7.7.1 Layer 1: Deterministic verifiers