chapter seven

7 Reasoning & Inference-Time Scaling

 

This chapter covers

  • How and why language models are trained to reason
  • How Reinforcement Learning with Verifiable Rewards (RLVR) builds on RLHF
  • Common implementation decisions for RLVR
  • Key reasoning models that established best practices

Reasoning models and inference-time scaling enabled a massive step in language model performance in the end of 2024, through 2025, and into the future. Inference-time scaling is the underlying property of machine learning systems that language models trained to think extensively before answering exploit so well. These models, trained with a large amount of reinforcement learning with verifiable rewards (RLVR) [1], still utilize large amounts of RLHF. In this chapter we review the path that led the AI community to a transformed appreciation for RL’s potential in language models, review the fundamentals of RLVR, highlight key works, and point to the future debates that will define the area in the next few years.

To start, at the 2016 edition of the Neural Information Processing Systems (NeurIPS) conference, Yann LeCun first introduced his now-famous cake metaphor for where learning happens in modern machine learning systems:

If intelligence is a cake, the bulk of the cake is unsupervised learning, the icing on the cake is supervised learning, and the cherry on the cake is reinforcement learning (RL).

7.1 The Origins of New Reasoning Models

7.1.1 Why Does RL Work Now?

7.1.2 RL Training vs. Inference-time Scaling

7.1.3 The Future (Beyond Reasoning) of RLVR

7.2 Understanding Reasoning Training Methods

7.2.1 Reasoning Research Pre OpenAI’s o1 or DeepSeek R1

7.2.2 Early Reasoning Models

7.2.3 Common Practices in Training Reasoning Models

7.3 Looking Ahead