chapter eight

8 Knowledge distillation: Making powerful models practical

This chapter covers

Distilling knowledge from large models
Using temperature-scaled soft targets
Building DeepSeek-R1’s distilled models

In our last chapter, we equipped our DeepSeek model with the ability to reason through reinforcement learning. The model can now solve complex math problems, write code, and think step-by-step through difficult questions. But there is a catch, and it is a big one.

The DeepSeek-R1 model we built is a 671-billion-parameter Mixture-of-Experts behemoth. Running it requires a cluster of eight high-end GPUs, costs thousands of dollars per day in compute, and is far too large to deploy on the devices where reasoning is actually needed: laptops, phones, and edge servers. What if we could take everything this massive model has learned, all of its reasoning ability, its mathematical insight, its code-writing skill, and compress it into a model small enough to run on a single consumer GPU? What if we could compress it into a model small enough to run on a phone?

That is exactly what knowledge distillation does. And the results are nothing short of extraordinary: DeepSeek released a distilled model with just 1.5 billion parameters that outperforms GPT-4o on mathematical reasoning benchmarks. A model roughly 450 times smaller, beating one of the most powerful AI systems in the world.

In this chapter, we will understand how this is possible. As illustrated in figure 8.1, our roadmap will cover:

8.1 Why 671 billion parameters won’t fit in your pocket

8.1.1 The cost of intelligence

8.1.2 The dream: reasoning in your pocket

8.2 The teacher-student paradigm

8.2.1 Why soft labels carry more information than hard labels

8.2.2 The master chef analogy

8.3 Temperature and dark knowledge

8.3.1 The temperature-scaled softmax

8.3.2 A numerical walkthrough: seeing the dark knowledge emerge

8.3.3 What the dark knowledge reveals

8.3.4 Why T² scaling matters

8.3.5 The temperature Goldilocks zone

8.4 Building the distillation loss: From naive to powerful

8.4.1 Attempt #1: Training from scratch with hard labels

8.4.2 Attempt #2: Matching the teacher’s output at T=1

8.4.3 Temperature-scaled soft targets

8.4.4 The complete distillation loss

8.5 DeepSeek-R1’s distillation recipe

8.5.1 From logits to language: a paradigm shift

8.5.2 The 800K training dataset: Rejection sampling at scale

8.5.3 Classical KD versus DeepSeek’s approach

8.5.4 The six distilled models

8.5.5 The economics of distillation

8.6 Implementing knowledge distillation in PyTorch

8.6.1 Defining teacher and student models

8.6.2 The training loop