chapter six

6 Training reasoning models with reinforcement learning

 

This chapter covers

  • The difference between reinforcement learning with human feedback (RLHF) and reinforcement learning with verifiable rewards (RLVR)
  • Training reasoning LLMs as a reinforcement learning problem with task-correctness rewards
  • Sampling multiple responses per prompt to compute group-relative learning signals
  • Updating the LLM weights using group-based policy optimization for improved reasoning

Reasoning performance and answer accuracy can be improved by both increasing the inference compute budget and by specific model training methods. This chapter, as shown in figure 6.1, focuses on reinforcement learning, which is the most commonly used training method for reasoning models.

Figure 6.1 A mental model of the topics covered in this book. This chapter focuses on techniques that improve reasoning with additional training (stage 4). Specifically, this chapter covers reinforcement learning.

The next section provides a general introduction to reinforcement learning in the context of LLMs before discussing the two common reinforcement learning approaches used for LLMs.

6.1 Introduction to reinforcement learning for LLMs

6.1.1 The original reinforcement learning pipeline with human feedback (RLHF)

6.1.2 From human feedback to verifiable rewards (RLVR)

6.2 Reinforcement learning with verifiable rewards walkthrough using GRPO

6.2.1 High-level GRPO intuition via a chef analogy

6.2.2 The high-level GRPO procedure

6.3 Loading a pre-trained model

6.4 Loading a MATH training subset

6.5 Sampling rollouts

6.6 Calculating rewards

6.7 Preparing learning signals from rollouts via advantages

6.8 Scoring rollouts with sequence log-probabilities

6.9 From advantages to policy updates via the GRPO loss

6.10 Putting everything together in a single GRPO function

6.11 Implementing the GRPO training loop

6.12 Loading and evaluating saved model checkpoints

6.13 Summary