chapter three

3 Bridging the semantic gap with learned metrics: BERTScore and COMET

This chapter covers

Why lexical metrics fail to capture semantic equivalences and paraphrasing
How BERTScore uses contextual embeddings to measure semantic similarity
COMET’s approach to learning evaluation functions directly from human judgments
Practical guidelines for choosing between lexical and semantic metrics

Now we’ll explore two papers that transformed automatic evaluation from lexical matching to semantic understanding. The first is “BERTScore: Evaluating Text Generation with BERT by Zhang et al, 2020. BERTScore uses contextual embeddings from pretrained language models to measure semantic similarity between generated text and references, recognizing that “attorney” and “lawyer” convey the same meaning even though they share no characters. The second is “COMET: A Neural Framework for MT Evaluation” by Rei et al, 2020. COMET (Crosslingual Optimized Metric for Evaluation of Translation) takes a fundamentally different approach. Rather than designing a similarity formula, it learns an evaluation function directly from human quality judgements.

Consider evaluating two AI-generated responses against the reference "people like foreign cars":

Candidate 1: “People like visiting places abroad.”
Candidate 2: “Consumers prefer imported cars.”

3.1 The limits of lexical matching

3.1.1 A taxonomy of evaluation approaches

3.2 BERTScore: Semantic similarity at scale

3.2.1 What are embeddings?

3.2.2 From words to semantics

3.2.3 The geometric intuition

3.2.4 The BERTScore algorithm

3.2.5 BERTScore summary

3.3 COMET: Learning evaluation from human preferences

3.3.1 From designed to learned metrics

3.3.2 The design of COMET

3.3.3 The COMET architecture

3.3.4 COMET summary

3.4 Semantic metrics in practice

3.4.1 Choosing the right metric

3.4.2 Limitations and failure modes

3.4.3 Combining metrics in practice

3.4.4 Towards generative evaluation

3.5 Summary