chapter nine

9 LLM-as-a-judge fundamentals

 

This chapter covers

  • Defining fundamentals required for successful LLM-as-a-judge evaluations
  • Illustrating how deterministic metrics break down
  • Detailing key engineering considerations to implement LLM-as-a-judge systems
  • Defining when to not leverage LLMs for evaluations

LLM-as-a-judge refers to an evaluation approach in which a large language model (LLM) is used to assess, score, or rank the outputs of another model (or multiple models) against defined criteria. Although LLMs are at this point in time commonly used to power product features such as chat interfaces, recommendation explanations, and content generation, in this chapter the model’s role is fundamentally different. The LLM in this context is not generating product output, but instead operating in an evaluative capacity by acting as a scalable, programmatic approach in scenarios where deterministic metrics may fail to capture quality, nuance, or intent.

LLM-as-a-judge extends classic evaluation methods by handling less-specified, more subjective tasks at scale while retaining richer contextual information. But none of this matters if your goals, learning intent, and product dimensions aren’t crystal clear.

9.1 The case for LLM-as-a-judge

9.2 When deterministic metrics aren't enough

9.3 What LLM judges are good at and what they are not

9.4 Getting the goal and data right

9.4.1 Defining the goal

9.4.2 Designing the data

9.5 How to design a good LLM-as-a-judge evaluation

9.5.1 Defining the judging task

9.5.2 Designing the prompt

9.5.3 Choose the right judge format: scoring, ranking, or pairwise comparison

9.5.4 Selecting the right evaluator model

9.6 Prompt evaluation feedback loop

9.6.1 LLM-as-a-judge for the movie recommender model

9.7 Common LLM judge failure modes

9.7.1 Position bias

9.7.2 Verbosity bias

9.7.3 Style and self-preference bias

9.7.4 Context insufficiency

9.7.5 Prompt injection against the judge

9.7.6 Overconfidence and fluent explanations

9.8 Validating an LLM-as-a-judge evaluation

9.9 When not to use LLM-as-a-judge

9.10 Engineering Considerations

9.10.1 Prompt debugging 101

9.10.2 Document everything, just as you would any other code base

9.10.3 Calibrate before increasing scale

9.10.4 Observability isn’t optional, it's a must.

9.10.5 Engineering factors influencing LLM model selection

9.11 Summary