12 Evaluations and benchmarks

 

This chapter covers

  • Understanding the significance of benchmarking and evaluating LLMs
  • Learning different evaluation metrics across both traditional and newer GenAI-specific metrics
  • Benchmarking model performance utilizing comprehensive benchmarks such as HELM, HEIM, MMLU, and HellaSWAG, to name a few
  • Implementing comprehensive evaluation strategies, ensuring continuous improvement based on evaluative insights
  • Best practices for Evaluation benchmarks and key evaluation criteria to consider

Given the recent surge of interest in GenAI and particularly LLMs, it's crucial to approach these novel and uncertain features with caution and responsibility. Many leaderboards and research have shown that LLMs can match human performance in various tasks, such as taking standardized tests or creating art, sparking enthusiasm and attention. However, their novelty and uncertainties necessitate careful handling.

The role of benchmarking LLMs in production deployment cannot be overstated. It evaluates performance, compares models, guides improvements, accelerates technological advancement, manages costs and latency, and ensures efficient task flow for real-world applications. While evaluations are part of LLMOps, their criticality in ensuring LLMs meet the demands of various applications warrants a separate discussion in this chapter.

12.1 LLM Evaluations

 
 
 

12.2 Traditional Evaluation Metrics

 

12.3 Example – Traditional metric evaluation

 

12.4 LLM Task-Specific Benchmarks

 
 
 

12.4.1 G-Eval: A Measuring Approach for NLG Evaluation

 

12.4.2 Example – LLM-based Evaluation Metrics

 
 

12.4.3 HELM

 

12.4.4 HEIM

 

12.4.5 HellaSWAG

 

12.4.6 Massive Multitask Language Understanding (MMLU)

 
 

12.4.7 Using Azure AI Studio for Evaluations

 
 
 

12.4.8 DeepEval – An LLM Evaluation Framework

 
 
 
 

12.5 New Evaluation Benchmarks

 
 
 

12.5.1 SWE-Bench

 

12.5.2 MMMU

 
 

12.5.3 MoCa

 

12.5.4 HaluEval

 
 

12.6 Human Evaluation

 
 
 

12.7 Summary

 
 
 

12.8 References

 
 
 
 
sitemap

Unable to load book!

The book could not be loaded.

(try again in a couple of minutes)

manning.com homepage
test yourself with a liveTest