3 Bridging the semantic gap with learned metrics: BERTScore and COMET
This chapter covers
- Why lexical metrics fail to capture semantic equivalences and paraphrasing
- How BERTScore uses contextual embeddings to measure semantic similarity
- COMET’s approach to learning evaluation functions directly from human judgments
- Practical guidelines for choosing between lexical and semantic metrics
Now we’ll explore two papers that transformed automatic evaluation from lexical matching to semantic understanding. The first is “BERTScore: Evaluating Text Generation with BERT by Zhang et al, 2020. BERTScore uses contextual embeddings from pretrained language models to measure semantic similarity between generated text and references, recognizing that “attorney” and “lawyer” convey the same meaning even though they share no characters. The second is “COMET: A Neural Framework for MT Evaluation” by Rei et al, 2020. COMET (Crosslingual Optimized Metric for Evaluation of Translation) takes a fundamentally different approach. Rather than designing a similarity formula, it learns an evaluation function directly from human quality judgements.
Consider evaluating two AI-generated responses against the reference "people like foreign cars":
- Candidate 1: “People like visiting places abroad.”
- Candidate 2: “Consumers prefer imported cars.”