7 Reasoning & Inference-Time Scaling
This chapter covers
- How and why language models are trained to reason
- How Reinforcement Learning with Verifiable Rewards (RLVR) builds on RLHF
- Common implementation decisions for RLVR
- Key reasoning models that established best practices
Reasoning models and inference-time scaling enabled a massive step in language model performance in the end of 2024, through 2025, and into the future. Inference-time scaling – the ability to improve model performance by using more computation during generation, such as producing longer reasoning chains or sampling multiple responses – is the property that language models trained to think extensively before answering exploit so well. These models, trained with a large amount of reinforcement learning with verifiable rewards (RLVR) [1], still utilize large amounts of RLHF. In this chapter we review the path that led the AI community to a transformed appreciation for RL’s potential in language models, review the fundamentals of RLVR, highlight key works, and point to the future debates that will define the area in the next few years.
To start, at the 2016 edition of the Neural Information Processing Systems (NeurIPS) conference, Yann LeCun first introduced his now-famous cake metaphor for where learning happens in modern machine learning systems: