15 Test-time compute and small language models
This chapter covers
- Defining test-time compute
- Using the OptiLLM inference proxy for test-time compute with large and small language models
- Embedding test-time compute with open source SLMs
- Tuning an SLM to reason through GRPO for a specific domain
This chapter introduces the concept of test-time compute, and it surveys state-of-the-art SLMs and libraries. It also provides a complete commodity-hardware example, showing how you can apply the Group Relative Policy Optimization (GRPO) technique used to train the DeepSeek-R1 models to specialize an SLM for a given domain.
15.1 Test-time compute
Test-time compute (TTC) is a new concept for LLMs—it emerged in 2024 and refers to the computational resources a model uses at inference time to generate responses, allowing it to “think” further and spend more time exploring alternatives before answering. The idea is especially useful for difficult tasks where even large models can struggle, such as math, logic, and coding. With TTC, a model can dynamically increase its “reasoning” time during inference, using search methods such as beam search or tree-based exploration (like Monte Carlo tree search) to explore multiple candidate outputs within a compute budget. It can then score those candidates and pick the best one or even execute generated code and check for errors. The tradeoff is simple: more “thinking” time can improve accuracy, but it increases compute costs.