3 Evaluating reasoning models

 

This chapter covers

  • Extracting final answers reliably from an LLM response
  • Verifying answer correctness by comparing an LLM's output to the reference solution using a symbolic math solver
  • Running a full evaluation pipeline by loading a pre-trained model, generating outputs, and grading them against a dataset

Evaluation is what lets us distinguish between LLMs that merely sound convincing and those that can solve problems correctly. LLM evaluation techniques span a broad range of approaches, from measuring task accuracy to making sure that LLMs adhere to specific safety standards.

In this chapter, we focus on implementing a verification-based method that checks whether an LLM can solve math problems accurately by comparing its own answers against reference solutions using a calculator-like implementation.

This verifier is particularly useful because it not only evaluates performance on math tasks but also introduces the principle of verifiable rewards, which is the foundation of the reinforcement learning approach to reasoning models that we will implement later in chapter 5. (Interested readers can find additional evaluation methods in appendix F.)

Figure 3.1 A mental model of the topics covered in this book. This chapter covers evaluation methods (stage 2), with a special focus on implementing verifiers.

3.1 Building a math verifier

3.2 Loading a pre-trained model to generate text

3.3 Implementing a wrapper for easier text generation

3.4 Extracting the final answer box

3.5 Normalizing the extracted answer

3.6 Verifying mathematical equivalence

3.7 Grading answers

3.8 Loading the evaluation dataset

3.9 Evaluating the model

3.10 Summary