chapter fifteen

15 Test-time compute and small language models

 

This chapter covers

  • Defining test-time compute
  • Using the OptiLLM inference proxy for test-time compute with large and small language models
  • Embedding test-time compute with open source SLMs
  • Tuning an SLM to reason through GRPO for a specific domain

This chapter introduces the concept of test-time compute, and it surveys state-of-the-art SLMs and libraries. It also provides a complete commodity-hardware example, showing how you can apply the Group Relative Policy Optimization (GRPO) technique used to train the DeepSeek-R1 models to specialize an SLM for a given domain.

15.1 Test-time compute

Test-time compute (TTC) is a new concept for LLMs—it emerged in 2024 and refers to the computational resources a model uses at inference time to generate responses, allowing it to “think” further and spend more time exploring alternatives before answering. The idea is especially useful for difficult tasks where even large models can struggle, such as math, logic, and coding. With TTC, a model can dynamically increase its “reasoning” time during inference, using search methods such as beam search or tree-based exploration (like Monte Carlo tree search) to explore multiple candidate outputs within a compute budget. It can then score those candidates and pick the best one or even execute generated code and check for errors. The tradeoff is simple: more “thinking” time can improve accuracy, but it increases compute costs.

15.2 The OptiLLM inference proxy

15.3 SLMs with embedded test-time compute

15.4 Building a reasoning domain-specific SLM

Summary