chapter three

3 Training Overview

This chapter covers

Reinforcement learning basics
How RLHF relates to traditional RL
An outline of the RLHF tools you’ll learn in this book
RLHF training recipes of popular models like InstructGPT and DeepSeek R1

In this chapter we provide a cursory overview of RLHF training, before getting into the specifics later in the book. RLHF, while optimizing a simple loss function, involves training multiple, different AI models in sequence and then linking them together in a complex, online optimization.

Here, we introduce the core objective of RLHF, which is optimizing a proxy reward for human preferences with a distance-based regularizer (along with showing how it relates to classical RL problems). Then we showcase canonical recipes which use RLHF to create leading models to show how RLHF fits in with the rest of post-training methods. These example recipes will serve as references for later in the book, where we describe different optimization choices you have when doing RLHF, and we will point back to how different key models used different steps in training.

To see concrete examples of how model outputs differ at each training stage, see the completions library at https://rlhfbook.com/library.

3.1 Problem Formulation

3.1.1 A Simple Example: The Thermostat

3 Training Overview

This chapter covers

3.1 Problem Formulation

3.1.1 A Simple Example: The Thermostat

3.1.2 Classic RL Example: CartPole

3.1.3 Manipulating the Standard RL Setup

3.1.4 Fine-tuning and Regularization

3.1.5 Optimization Tools

3.2 Canonical Training Recipes

3.2.1 InstructGPT

3.2.2 Tülu 3

3.2.3 DeepSeek R1

Summary