12 Evaluations and benchmarks

This chapter covers

Understanding the significance of benchmarking and evaluating LLMs
Learning different evaluation metrics across both traditional and newer GenAI-specific metrics
Benchmarking model performance utilizing comprehensive benchmarks such as HELM, HEIM, MMLU, and HellaSWAG, to name a few
Implementing comprehensive evaluation strategies, ensuring continuous improvement based on evaluative insights
Best practices for Evaluation benchmarks and key evaluation criteria to consider

Given the recent surge of interest in GenAI and particularly LLMs, it's crucial to approach these novel and uncertain features with caution and responsibility. Many leaderboards and research have shown that LLMs can match human performance in various tasks, such as taking standardized tests or creating art, sparking enthusiasm and attention. However, their novelty and uncertainties necessitate careful handling.

The role of benchmarking LLMs in production deployment cannot be overstated. It evaluates performance, compares models, guides improvements, accelerates technological advancement, manages costs and latency, and ensures efficient task flow for real-world applications. While evaluations are part of LLMOps, their criticality in ensuring LLMs meet the demands of various applications warrants a separate discussion in this chapter.

12.1 LLM Evaluations

12.2 Traditional Evaluation Metrics

12.3 Example – Traditional metric evaluation

12.4 LLM Task-Specific Benchmarks

12.4.1 G-Eval: A Measuring Approach for NLG Evaluation

12.4.2 Example – LLM-based Evaluation Metrics

12.4.3 HELM

12.4.4 HEIM

12.4.5 HellaSWAG

12.4.6 Massive Multitask Language Understanding (MMLU)

12.4.7 Using Azure AI Studio for Evaluations

12.4.8 DeepEval – An LLM Evaluation Framework

12.5 New Evaluation Benchmarks

12.5.1 SWE-Bench

12.5.2 MMMU

12.5.3 MoCa

12.5.4 HaluEval

12.6 Human Evaluation

12.7 Summary

12.8 References