2 The dawn of automatic evaluation: BLEU and ROUGE
This chapter covers
- Understanding BLEU and ROUGE
- Understanding reusable design principles for building evaluation metrics
- Real-life applications of BLEU, ROUGE, and other similar scorings
This chapter explores two seminal papers that launched the field of automatic text evaluation. The first is “BLEU: a Method for Automatic Evaluation of Machine Translation” by Kishore Papineeni et al, 2002. BLEU, which stands for Bilingual Evaluation Understudy, is an automatic metric that measures translation quality by comparing lexical overlaps between machine-generated translations and human reference translations. The second is “ROUGE: A Package for Automatic Evaluation of Summaries” by Lin et al, 2004. ROUGE, Recall-Oriented Understudy for Gisting Evaluation, adapts similar lexical overlap-based principles but focuses on recall rather than precision, making it better suited for evaluating text summarizations.
Definition
Lexical refers to the words or vocabulary of a language as distinct from its grammar and construction. In AI, this refers to the analysis of individual tokens (keywords), their spelling, and their specific arrangement in a text.