chapter seven

7 Improving GRPO for reinforcement learning

This chapter covers

Interpreting training curves and evaluation metrics
Preventing the model from exploiting the reward signal
Extending task-correctness rewards with additional response-formatting rewards

Previously, we implemented the GRPO algorithm for reinforcement learning with verifiable rewards (RLVR) end to end. Now, as shown in figure 7.1, we pick up from that baseline and focus on what happens when we run longer training.

Figure 7.1 A mental model of the topics covered in this book. This chapter provides a deeper coverage of the GRPO algorithm for reinforcement learning with verifiable rewards.

In particular, we will discuss which metrics are worth tracking (beyond reward and accuracy), how to spot failure modes early, and why training can become unstable even when the code is "correct." We then introduce practical GRPO extensions and fixes used in reasoning-model training.

7.1 Improving GRPO

After implementing GRPO (group relative policy optimization) in the previous chapter, we now revisit and analyze the training run more thoroughly. Also, we discuss a collection of practical tips and algorithmic choices that become important in real training runs. These topics are summarized in the chapter overview in figure 7.2.

Figure 7.2 A chapter overview showing the different topics being covered in this chapter.

7.2 Tracking GRPO performance metrics

7.2.1 Executing a GRPO training run

7.2.2 Inspecting the GRPO training run

7 Improving GRPO for reinforcement learning

This chapter covers

Figure 7.1 A mental model of the topics covered in this book. This chapter provides a deeper coverage of the GRPO algorithm for reinforcement learning with verifiable rewards.

7.1 Improving GRPO

Figure 7.2 A chapter overview showing the different topics being covered in this chapter.

7.2 Tracking GRPO performance metrics

7.2.1 Executing a GRPO training run

7.2.2 Inspecting the GRPO training run

7.3 Tracking more advanced GRPO performance metrics

7.3.1 Advantage tracking

7.3.2 Entropy tracking

7.3.3 Plotting additional GRPO metrics

7.4 Stabilizing sequence-level GRPO using clipped policy ratios

7.4.1 Computing clipped policy ratios

7.4.2 Training with clipped policy ratios

7.5 Controlling how much the model changes with a KL term

7.5.1 Implementing the KL loss term

7.5.2 Training with a KL loss term

7.6 Adding an explicit format reward

7.6.1 Using `<think>` tokens

7.6.2 Training a model to emit `<think>` tokens

7.6.3 More GRPO modifications, tips, and tricks

7.7 Summary

7 Improving GRPO for reinforcement learning

This chapter covers

Figure 7.1 A mental model of the topics covered in this book. This chapter provides a deeper coverage of the GRPO algorithm for reinforcement learning with verifiable rewards.

7.1 Improving GRPO

Figure 7.2 A chapter overview showing the different topics being covered in this chapter.

7.2 Tracking GRPO performance metrics

7.2.1 Executing a GRPO training run

7.2.2 Inspecting the GRPO training run

7.3 Tracking more advanced GRPO performance metrics

7.3.1 Advantage tracking

7.3.2 Entropy tracking

7.3.3 Plotting additional GRPO metrics

7.4 Stabilizing sequence-level GRPO using clipped policy ratios

7.4.1 Computing clipped policy ratios

7.4.2 Training with clipped policy ratios

7.5 Controlling how much the model changes with a KL term

7.5.1 Implementing the KL loss term

7.5.2 Training with a KL loss term

7.6 Adding an explicit format reward

7.6.1 Using <think> tokens

7.6.2 Training a model to emit <think> tokens

7.6.3 More GRPO modifications, tips, and tricks

7.7 Summary

7.6.1 Using `<think>` tokens

7.6.2 Training a model to emit `<think>` tokens