chapter six

6 Reinforcement Learning

 

This chapter covers

  • How reinforcement learning is used to optimize human preferences
  • Mathematics and intuitions for policy-gradient algorithms
  • The derivations and trade-offs of different algorithms
  • New algorithms for reasoning models, such as GRPO, GSPO, and CISPO
  • Implementation details of modern RLHF stacks

In the RLHF process, the reinforcement learning algorithm slowly updates the model’s weights with respect to feedback from a reward model. The policy – the model being trained – generates completions to prompts in the training set, then the reward model scores them, and then the reinforcement learning optimizer takes gradient steps based on this information (see fig. 6.1 for an overview). This chapter explains the mathematics and trade-offs across various algorithms used to learn from the signal the reward model gives to on-policy data (i.e., data generated by the current version of the model being trained). These algorithms are run for a period of many epochs, often thousands or millions of batches across a larger set of prompts, with gradient updates in between each of them.

6.1 Policy Gradient Algorithms

6.1.1 Vanilla Policy Gradient

6.1.2 REINFORCE

6.1.3 REINFORCE Leave One Out (RLOO)

6.1.4 Proximal Policy Optimization (PPO)

6.1.5 Group Relative Policy Optimization (GRPO)

6.1.6 Group Sequence Policy Optimization (GSPO)

6.1.7 Clipped Importance Sampling Policy Optimization (CISPO)

6.1.8 Comparing Algorithms

6.2 Implementation

6.2.1 Policy Gradient Basics

6.2.2 Loss Aggregation

6.2.3 Asynchronicity

6.2.4 Proximal Policy Optimization

6.2.5 Group Relative Policy Optimization