8 Knowledge distillation: Making powerful models practical
This chapter covers
- Distilling knowledge from large models
- Using temperature-scaled soft targets
- Building DeepSeek-R1’s distilled models
In our last chapter, we equipped our DeepSeek model with the ability to reason through reinforcement learning. The model can now solve complex math problems, write code, and think step-by-step through difficult questions. But there is a catch, and it is a big one.
The DeepSeek-R1 model we built is a 671-billion-parameter Mixture-of-Experts behemoth. Running it requires a cluster of eight high-end GPUs, costs thousands of dollars per day in compute, and is far too large to deploy on the devices where reasoning is actually needed: laptops, phones, and edge servers. What if we could take everything this massive model has learned, all of its reasoning ability, its mathematical insight, its code-writing skill, and compress it into a model small enough to run on a single consumer GPU? What if we could compress it into a model small enough to run on a phone?
That is exactly what knowledge distillation does. And the results are nothing short of extraordinary: DeepSeek released a distilled model with just 1.5 billion parameters that outperforms GPT-4o on mathematical reasoning benchmarks. A model roughly 450 times smaller, beating one of the most powerful AI systems in the world.
In this chapter, we will understand how this is possible. As illustrated in figure 8.1, our roadmap will cover: