chapter four

4 Improving reasoning with inference-time scaling

This chapter covers

Prompting an LLM to explain its reasoning to improve answer accuracy
Modifying the text generation function to produce diverse responses
Improving reasoning reliability by sampling multiple responses

Reasoning performance and answer accuracy can be improved without retraining or modifying the model itself. These methods operate during inference, when the model generates text. As shown in the overview in figure 4.1, in this chapter, we cover two inference-scaling methods. As we will see later in this chapter, both methods more than double the accuracy of the base model we used in previous chapters.

Figure 4.1 A mental model of the topics covered in this book. This chapter focuses on techniques that improve reasoning without additional training (stage 3). In particular, it extends the text-generation function and implements a voting-based method to improve answer accuracy. The next chapter then introduces an inference-time scaling approach where the model iteratively refines its own answers.

The next section provides a general introduction to inference-time scaling before discussing the inference methods that are shown in figure 4.1 in more detail.

4.1 Introduction to inference-time scaling

In general, there are two main strategies to improve reasoning:

Increasing training compute and
increasing inference compute (also known as inference-time scaling or test-time scaling).

4.2 Loading a pre-trained model

4 Improving reasoning with inference-time scaling

This chapter covers

4.1 Introduction to inference-time scaling

4.2 Loading a pre-trained model

4.3 Generating better responses with chain-of-thought prompting

4.4 Controlling output diversity with temperature scaling

4.4.1 Understanding the process of selecting the next token

4.4.2 Rescaling token scores (logits) via a temperature parameter

4.4.3 Sampling the next token from a probability distribution

4.4.4 Adding temperature scaling to the text generation function

4.5 Balancing diversity and coherence with top-p sampling

4.5.1 Selecting a subset of top-p tokens

4.5.2 Adding a top-p filter to the text generation function

4.6 Improving response accuracy with self-consistency

4.7 Summary