chapter fifteen

15 Regularization

 

This chapter covers

  • How a KL divergence constrains the RLHF process
  • Why regularization prevents models from producing nonsensical outputs
  • Why RL generalizes better than SFT through implicit regularization
  • Other regularization techniques for training LLMs

In this book we’ve learned many tools for modifying the model, to learn from human preferences, verifiable rewards, and other valuable signals. All the methods we use are very powerful, and can cause the model to change too much relative to the strong, general model from the previous training stage (often called the reference model). When the model learns too much from a given reward, causing out-of-distribution performance to drop, this is called “over-optimization” (as we discussed in the previous chapter).

Throughout the RLHF optimization, many regularization steps are used to prevent over-optimization of the reward model. Over-optimization in these contexts looks like models that output nonsensical text. Some examples of optimization “off the rails” are that models can output followable math reasoning with extremely incorrect answers, repeated text, switching languages, or excessive special characters. This chapter covers the different methods used to control the optimization of models.

15.1 KL Divergence in RL Optimization

15.1.1 Reference Model to Generations

15.1.2 Implementation Example

15.2 Implicit Regularization

15.2.1 SFT Memorizes, RL Generalizes

15.2.2 Retaining by Doing: On-Policy Data Mitigates Forgetting

15.2.3 RL’s Razor: Why Online RL Forgets Less

15.3 Other Types of Regularization

15.3.1 Pretraining Gradients

15.3.2 Margin-based Regularization

Summary