part three

Part 3: LLM-as-a-judge evaluations

The next few chapters introduce a new frontier in model evaluation: using LLMs themselves as evaluators.You might be thinking…we’re using models to evaluate other models? Yes. Yes we are.

And there’s a good reason for it. LLM-based evaluations scale far beyond what human spot-checking can handle. They can process thousands of outputs, compare multiple model variants, and identify subtle qualitative differences—things traditional metrics often miss. Unlike rule-based metrics, LLM-as-a-judge evaluations can assess open-ended, subjective qualities: clarity, coherence, tone, safety, or helpfulness. When implemented carefully.

In Chapter 9, we’ll start with the fundamentals: defining what makes a good LLM-as-a-judge setup, understanding prompt design, and establishing the engineering foundations—like versioning, calibration, and cost control that are often forgotten because it’s just so easy to get started with LLMs as evaluation frameworks, but it's harder to scale and maintain.

In Chapter 10, we’ll get more technical, exploring design patterns, prompting strategies, and practical workflows through hands-on notebooks. You’ll see how these evaluations can be integrated into real pipelines to guide model iteration, accelerate experiments, and strengthen decision-making.

This part of the book is about evolving beyond static metrics toward evaluations that reason, not just measure.