chapter seven

7 Reinforcement learning: From policy gradients to GRPO

This chapter covers

RL for LLMs: PPO, rewards, baselines, GRPO, R1 models
Why PPO needs policy, value, reward, and reference models
PyTorch GRPO walkthrough with verifier rewards and training steps

In the previous chapters, we built the architectural backbone of a DeepSeek-style model from scratch. We implemented Multi-Head Latent Attention, Decoupled RoPE, Mixture-of-Experts routing, Multi-Token Prediction, and the training pipeline that turns those components into a capable base model. That base model can predict text, but prediction alone is not the same as deliberate problem solving.

This chapter moves into post-training: the stage where a pretrained language model is pushed toward behaviors that are useful for reasoning. The central tool is reinforcement learning. Instead of showing the model only the next token in a dataset, reinforcement learning lets the model try complete responses and then assigns a reward to those responses. The training update then makes rewarded behavior more likely in future generations.

7.1 The reinforcement learning framework

7.1.1 The agent-environment interface

7.1.2 Mapping reinforcement learning to language generation

7.1.3 Sparse rewards and credit assignment

7.2 Policy-gradient methods: Updating the LLM with rewards

7.3 Sampling actions instead of taking argmax

7.3.1 The basic policy-gradient update

7.3.2 Why raw rewards are too noisy

7.4 PPO: The standard practical baseline

7.4.1 PPO components for LLM training

7.4.2 Where the value function comes from

7.4.3 Why the reference model matters

7.5 GRPO: DeepSeek's value model simplification

7.5.1 Group-relative advantages

7.5.2 The full GRPO objective

7.5.3 Why GRPO is cheaper than PPO

7.6 Reinforcement learning with verifiable rewards

7.6.1 Designing a verifier

7.6.2 Avoiding overstatement

7.7 How reasoning emerges in DeepSeek-R1-Zero

7.8 DeepSeek-R1, R1-Zero, and Distill

7.8.1 From R1-Zero to R1

7.8.2 Why cold-start data matters

7.8.3 Rejection sampling as data construction

7.8.4 Restoring general assistant behavior

7.8.5 The final hybrid reward stage

7.8.6 Distillation into smaller models

7.9 Minimal chapter code

7.9.1 A minimal verifier

7.9.2 Computing group-relative advantages

7.9.3 The clipped GRPO loss