7 Reinforcement learning: From policy gradients to GRPO
This chapter covers
- RL for LLMs: PPO, rewards, baselines, GRPO, R1 models
- Why PPO needs policy, value, reward, and reference models
- PyTorch GRPO walkthrough with verifier rewards and training steps
In the previous chapters, we built the architectural backbone of a DeepSeek-style model from scratch. We implemented Multi-Head Latent Attention, Decoupled RoPE, Mixture-of-Experts routing, Multi-Token Prediction, and the training pipeline that turns those components into a capable base model. That base model can predict text, but prediction alone is not the same as deliberate problem solving.
This chapter moves into post-training: the stage where a pretrained language model is pushed toward behaviors that are useful for reasoning. The central tool is reinforcement learning. Instead of showing the model only the next token in a dataset, reinforcement learning lets the model try complete responses and then assigns a reward to those responses. The training update then makes rewarded behavior more likely in future generations.
7.1 The reinforcement learning framework
7.1.1 The agent-environment interface
7.1.2 Mapping reinforcement learning to language generation
7.1.3 Sparse rewards and credit assignment
7.2 Policy-gradient methods: Updating the LLM with rewards
7.3 Sampling actions instead of taking argmax
7.3.1 The basic policy-gradient update
7.3.2 Why raw rewards are too noisy
7.4 PPO: The standard practical baseline
7.4.1 PPO components for LLM training
7.4.2 Where the value function comes from
7.4.3 Why the reference model matters
7.5 GRPO: DeepSeek's value model simplification
7.5.1 Group-relative advantages
7.5.2 The full GRPO objective
7.5.3 Why GRPO is cheaper than PPO
7.6 Reinforcement learning with verifiable rewards
7.6.1 Designing a verifier
7.6.2 Avoiding overstatement
7.7 How reasoning emerges in DeepSeek-R1-Zero
7.8 DeepSeek-R1, R1-Zero, and Distill
7.8.1 From R1-Zero to R1
7.8.2 Why cold-start data matters
7.8.3 Rejection sampling as data construction
7.8.4 Restoring general assistant behavior
7.8.5 The final hybrid reward stage
7.8.6 Distillation into smaller models
7.9 Minimal chapter code
7.9.1 A minimal verifier
7.9.2 Computing group-relative advantages
7.9.3 The clipped GRPO loss