4 LLM-as-a-judge: The new paradigm for evaluation
This chapter covers
- Measuring evaluator quality with correlation coefficients and agreement metrics
- G-Eval's framework for systematic LLM-based evaluation using chain-of-thought
- LLM-as-a-judge biases and limitations
- Practical guidance for implementing and calibrating LLM-based evaluation systems
Consider this scenario: You've deployed a customer service chatbot and need to evaluate thousands of responses daily. Human annotators are expensive and slow. BLEU scores are meaningless for open-ended dialogue. BERTScore tells you about semantic similarity, but not whether the response actually helps the customer. COMET requires reference translations you don't have. What if you could ask GPT-4 to judge whether each response was helpful, accurate, appropriately voiced, and have it explain its reasoning?
In chapter 3, we traced the evolution from designed metrics to learned metrics. COMET represented a paradigm shift: rather than encoding human intuitions into formulas, it learned evaluation directly from human judgments. But COMET still outputs a black-box score. It cannot tell you why a translation scored 0.72 or what specific errors it detected.