chapter six

6 Training reasoning models with reinforcement learning

 

This chapter covers

  • Training reasoning LLMs as a reinforcement learning problem with task-correctness rewards
  • Sampling multiple responses per prompt to compute group-relative learning signals
  • Updating the LLM weights using group-based policy optimization for improved reasoning

Reasoning performance and answer accuracy can be improved by both increasing the inference compute budget and by specific model training methods. This chapter, as shown in figure 6.1, focuses on reinforcement learning, which is the most commonly used training method for reasoning models.

Figure 6.1 A mental model of the topics covered in this book. This chapter focuses on techniques that improve reasoning with additional training (stage 4). Specifically, this chapter covers reinforcement learning.

The next section provides a general introduction to reinforcement learning in the context of LLMs before discussing the two common reinforcement learning approaches used for LLMs.

6.1 Introduction to reinforcement learning for LLMs

Inference-time scaling and training-time scaling are two distinct approaches for improving the reasoning performance of large language models, as illustrated in figure 6.2. Inference-time scaling increases accuracy by spending more computation per generated answer, whereas training-time scaling improves accuracy by investing additional computation during training. This chapter focuses on training-time scaling.

6.2 The original reinforcement learning pipeline with human feedback (RLHF)

6.3 From human feedback to verifiable rewards (RLVR)

6.4 Reinforcement learning with verifiable rewards walkthrough using GRPO

6.5 High-level GRPO intuition via a chef analogy

6.6 The high-level GRPO procedure

6.7 Loading a pre-trained model

6.8 Loading a MATH training subset

6.9 Sampling rollouts

6.10 Calculating rewards

6.11 Preparing learning signals from rollouts via advantages

6.12 Scoring rollouts with sequence log-probabilities

6.13 From advantages to policy updates via the GRPO loss

6.14 Putting everything together in a single GRPO function

6.15 Implementing the GRPO training loop

6.16 Loading and evaluating saved model checkpoints

6.17 Summary