chapter four

4 LLM-as-a-judge: The new paradigm for evaluation

 

This chapter covers

  • Measuring evaluator quality with correlation coefficients and agreement metrics
  • G-Eval's framework for systematic LLM-based evaluation using chain-of-thought
  • LLM-as-a-judge biases and limitations
  • Practical guidance for implementing and calibrating LLM-based evaluation systems

Consider this scenario: You've deployed a customer service chatbot and need to evaluate thousands of responses daily. Human annotators are expensive and slow. BLEU scores are meaningless for open-ended dialogue. BERTScore tells you about semantic similarity, but not whether the response actually helps the customer. COMET requires reference translations you don't have. What if you could ask GPT-4 to judge whether each response was helpful, accurate, appropriately voiced, and have it explain its reasoning?

In chapter 3, we traced the evolution from designed metrics to learned metrics. COMET represented a paradigm shift: rather than encoding human intuitions into formulas, it learned evaluation directly from human judgments. But COMET still outputs a black-box score. It cannot tell you why a translation scored 0.72 or what specific errors it detected.

4.1 Measuring evaluator agreement

4.1.1 Why correlation matters for evaluation metrics

4.1.2 Spearman’s rank correlation

4.1.3 Kendall’s tau

4.1.4 Cohen’s Kappa for categorical agreement

4.1.5 Practical considerations

4.2 G-Eval: Systematic LLM-based evaluation

4.2.1 From traditional metrics to LLM-as-a-judge

4.2.2 The G-Eval prompting framework

4.2.3 Probability-weighted scoring

4.2.4 Chain-of-thought for evaluation

4.2.5 G-Eval summary

4.3 Judging LLM-as-a-judge

4.3.1 Position bias

4.3.2 Verbosity bias

4.3.3 Self-enhancement bias

4.3.4 Limited reasoning ability

4.3.5 Biases and mitigation strategies summary

4.4 Implementing LLM-as-a-judge

4.4.1 Online versus offline evaluation

4.4.2 Choosing a judge type

4.4.3 Designing evaluation rubrics

4.4.4 Calibration and monitoring

4.4.5 When to use LLM judges

4.5 Summary