chapter twelve

12 Evaluations and benchmarks

This chapter covers

Understanding the significance of benchmarking and evaluating LLMs
Learning different evaluation metrics
Benchmarking model performance
Implementing comprehensive evaluation strategies
Best practices for evaluation benchmarks and key evaluation criteria to consider

Taking into account the recent surge of interest in GenAI and specifically in large language models (LLMs), it’s crucial to approach these novel and uncertain features cautiously and responsibly. Many leaderboards and studies have shown that LLMs can match human performance in various tasks, such as taking standardized tests or creating art, sparking enthusiasm and attention. However, their novelty and uncertainties necessitate careful handling.

The role of benchmarking LLMs in production deployment cannot be overstated. It involves evaluating performance, comparing models, guiding improvements, accelerating technological advancement, managing costs and latency, and ensuring efficient task flow for real-world applications. While evaluations are part of LLMOps, their criticality in ensuring LLMs meet the demands of various applications warrants a separate discussion in this chapter.

12.1 LLM evaluations

12.2 Traditional evaluation metrics

12.2.1 BLEU

12.2.2 ROUGE

12.2.3 BERTScore

12.2.4 An example of traditional metric evaluation

12.3 LLM task-specific benchmarks

12.3.1 G-Eval: A measuring approach for NLG evaluation

12.3.2 An example of LLM-based evaluation metrics

12.3.3 HELM

12.3.4 HEIM

12.3.5 HellaSWAG

12.3.6 Massive Multitask Language Understanding

12.3.7 Using Azure AI Studio for evaluations

12.3.8 DeepEval: An LLM evaluation framework

12.4 New evaluation benchmarks

12.4.1 SWE-bench

12.4.2 MMMU

12.4.3 MoCa

12.4.4 HaluEval

12.5 Human evaluation