appendix-f

appendix F  Common approaches to model evaluation

 

F.1 Understanding the main evaluation methods for LLMs

There are four common ways of evaluating trained LLMs in practice: multiple choice, verifiers, leaderboards, and LLM judges, as shown in figure F.1. Research papers, marketing materials, technical reports, and model cards (a term for LLM-specific technical reports) often include results from two or more of these categories.

Figure F.1 A model of the topics covered in this book, with a focus on the two broad evaluation categories covered in this appendix: benchmark-based evaluation and judgment-based evaluation

As you can see in figure F.1, the four evaluation categories introduced here fall into two groups: benchmark-based evaluation and judgment-based evaluation. Other measures, such as training loss, perplexity, and rewards, are typically used internally during model development. They are covered in the model-training chapters (chapters 6 to 8).

F.2 Evaluating answer-choice accuracy

Historically, one of the most widely used evaluation methods has been multiple-­choice benchmarks such as MMLU (short for Massive Multitask Language Understanding, https://huggingface.co/datasets/cais/mmlu).

F.2.1 Loading the model

F.2.2 Checking the generated answer letter

F.3 Using verifiers to check answers

F.4 Comparing models using preferences and leaderboards

F.5 Judging responses with other LLMs

F.5.1 Implementing the LLM-as-a-judge approach in Ollama

F.5.2 Evaluating responses with a grading rubric