chapter three

3 Training Overview

 

This chapter covers

  • Reinforcement learning basics
  • How RLHF relates to traditional RL
  • An outline of the RLHF tools you’ll learn in this book
  • RLHF training recipes of popular models like InstructGPT and DeepSeek R1

In this chapter we provide a cursory overview of RLHF training, before getting into the specifics later in the book. RLHF, while optimizing a simple loss function, involves training multiple, different AI models in sequence and then linking them together in a complex, online optimization.

Here, we introduce the core objective of RLHF, which is optimizing a proxy of reward of human preferences with a distance-based regularizer (along with showing how it relates to classical RL problems). Then we showcase canonical recipes which use RLHF to create leading models to show how RLHF fits in with the rest of post-training methods. These example recipes will serve as references for later in the book, where we describe different optimization choices you have when doing RLHF, and we will point back to how different key models used different steps in training.

3.1 Problem Formulation

3.1.1 Example RL Task: CartPole

3.1.2 Manipulating the Standard RL Setup

3.1.3 Fine-tuning and Regularization

3.1.4 Optimization Tools

3.2 Canonical Training Recipes

3.2.1 InstructGPT

3.2.2 Tülu 3

3.2.3 DeepSeek R1