chapter fourteen

14 Reasoning training and inference-time scaling

 

At the 2016 edition of the Neural Information Processing Systems (NeurIPS) conference, Yann LeCun first introduced his now-famous cake metaphor for where learning happens in modern machine learning systems:

> If intelligence is a cake, the bulk of the cake is unsupervised learning, the icing on the cake is supervised learning, and the cherry on the cake is reinforcement learning (RL).

This analogy is now largely complete with modern language models and recent changes to the post-training stack. In this analogy:

  • Self-supervised learning on vast swaths of internet data makes up the majority of the cake (especially when viewed in compute spent in FLOPs),
  • The beginning of post-training in supervised finetuning (SFT) for instructions tunes the model to a narrower distribution (along with the help of chosen examples for RLHF), and
  • Finally "pure" reinforcement learning (RL) is the cherry on top.

We learn just "a few bits" of information with RL in just a few training samples. This little bit of reasoning training emerged with reasoning models that use a combination of the post-training techniques discussed in this book to align preferences along with RL training on verifiable domains to dramatically increase capabilities such as reasoning, coding, and mathematics problem solving.

14.1 The Origins of New Reasoning Models

14.1.1 Why Does RL Work Now?

14.1.2 RL Training vs. Inference Time Scaling

14.1.3 The Future (Beyond Reasoning) of Reinforcement Finetuning

14.2 Understanding Reasoning Training Methods

14.2.1 Reasoning Research Pre OpenAI’s o1 or DeepSeek R1

14.2.2 Early Reasoning Models

14.2.3 Common Practices in Training Reasoning Models