chapter two

2 The dawn of automatic evaluation: BLEU and ROUGE

This chapter covers

Understanding BLEU and ROUGE
Understanding reusable design principles for building evaluation metrics
Real-life applications of BLEU, ROUGE, and other similar scorings

This chapter explores two seminal papers that launched the field of automatic text evaluation. The first is “BLEU: a Method for Automatic Evaluation of Machine Translation” by Kishore Papineeni et al, 2002. BLEU, which stands for Bilingual Evaluation Understudy, is an automatic metric that measures translation quality by comparing lexical overlaps between machine-generated translations and human reference translations. The second is “ROUGE: A Package for Automatic Evaluation of Summaries” by Lin et al, 2004. ROUGE, Recall-Oriented Understudy for Gisting Evaluation, adapts similar lexical overlap-based principles but focuses on recall rather than precision, making it better suited for evaluating text summarizations.

Definition

Lexical refers to the words or vocabulary of a language as distinct from its grammar and construction. In AI, this refers to the analysis of individual tokens (keywords), their spelling, and their specific arrangement in a text.

2.1 Evaluating translation with BLEU

2.1.1 Evaluation from first principles

2.1.2 The mathematics of modified N-gram precision

2.1.3 Why BLEU uses geometric mean

2.1.4 The zero-product property

2.1.5 Numerically Stable Implementations

2.1.6 Brevity penalty

2.1.7 Putting it all together

2.1.8 BLEU Summary

2.2 ROUGE: Adapting evaluation for summarization

2.2.1 The focus on recall

2.2.2 The ROUGE family of metrics

2.2.3 ROUGE summary

2.3 Classical metrics in the age of AI

2.3.1 Continued dominance in machine translation

2.3.2 Mathematical robustness as a design pattern

2.3.3 Operational telemetry and continuous integration

2.3.4 Data contamination detection and forensic analysis

2.3.5 Lexical evaluations as verifiable rewards

2.3.6 Beyond lexical matching

2.4 Summary