chapter eight

8 Distilling reasoning models for efficient reasoning

This chapter covers

Hard and soft distillation for reasoning models
Creating and preparing a teacher-generated reasoning dataset
Training and evaluating a distilled student model using cross-entropy loss

Reasoning performance can be improved not only through inference-time scaling and reinforcement learning, but also through distillation. In distillation, a smaller student model is trained on the reasoning traces and answers generated by a larger teacher model. As shown in the overview in figure 8.1, this chapter focuses on this training-time technique.

Figure 8.1 A model of the topics covered in this book. This chapter focuses on distillation, where a smaller student model is trained on reasoning traces generated by a larger teacher model.

We’ll look at model distillation in general before discussing the individual steps in the process.

8.1 Introducing model distillation for reasoning tasks

Model distillation involves training a smaller LLM, the student, on outputs produced by a larger LLM, the teacher. For reasoning models, these outputs usually include not only the final answer but also the intermediate reasoning trace that leads to it.

8.2 Generating a dataset for reasoning distillation

8.3 Loading the MATH training dataset for distillation

8.4 Building training examples

8 Distilling reasoning models for efficient reasoning

This chapter covers

Figure 8.1 A model of the topics covered in this book. This chapter focuses on distillation, where a smaller student model is trained on reasoning traces generated by a larger teacher model.

8.1 Introducing model distillation for reasoning tasks

8.2 Generating a dataset for reasoning distillation

8.3 Loading the MATH training dataset for distillation

8.4 Building training examples

8.4.1 Loading and understanding the tokenizer

8.4.2 Formatting and tokenizing the dataset

8.4.3 Filtering and splitting the dataset

8.5 Loading a pretrained model

8.6 Computing the training and validation losses

8.7 Implementing the training loop for distillation

8.8 Evaluating the distilled model

8.9 Future directions for reasoning models

8.10 Conclusions

8.10.1 What’s next

8.10.2 Staying up to date in a fast-moving field

Summary