chapter six

6 Knowledge recovery through distillation

This chapter covers

Deciding what to prune before training
Transferring knowledge to smaller models
Recovering lost capabilities efficiently
Applying proven recovery strategies

Structural pruning — both depth and width — tends to lead to a loss of model knowledge; the more you prune, the greater that loss. Recovering that knowledge is one of the most critical stages in the model rearchitecting pipeline: it's what determines whether the model can match, or even surpass, the base model.

That recovery depends on two things: which parts you've removed, and how you train to recover the lost knowledge. The first lesson I learned about this came before any training loop.

I remember the first day I compared two recovery experiments. Both started from models with the same number of Transformer blocks removed, but in one I had removed the blocks by importance and in the other I had removed the last blocks. The difference was striking: the model with importance-based pruning recovered its performance in a fraction of the training time. That moment changed how I saw optimization. Knowledge recovery doesn't start when we apply Knowledge Distillation; it starts when we decide which parts of the model to keep.

In Chapter 2, you saw the fundamentals of Knowledge Distillation: how a teacher model transfers its knowledge to a smaller student using only the model outputs (soft labels).

6.1 Choosing what to prune

6.1.1 Experiment configuration

6.1.2 Preparing the recovery dataset

6.1.3 Selecting the best candidate

6.2 Aligning Teacher and Student

6.3 Recovering the knowledge

6.3.1 The compound loss function

6 Knowledge recovery through distillation

This chapter covers

6.1 Choosing what to prune

6.1.1 Experiment configuration

6.1.2 Preparing the recovery dataset

6.1.3 Selecting the best candidate

6.2 Aligning Teacher and Student

6.3 Recovering the knowledge

6.3.1 The compound loss function

6.3.2 Training loop and strategy comparison

6.4 Advanced recovery techniques (FDD/ Skew KLD)

6.4.1 Skew KLD: correcting distribution bias

6.4.2 Feature Dynamics Distillation

6.4.3 Results: when advanced techniques make a difference

6.5 Practical guidelines for knowledge recovery

6.5.1 Composite loss configuration

6.5.2 When to use advanced techniques

6.5.3 Combining with width pruning

6.6 From paper to practice

6.6.1 Layer selection and alignment

6.6.2 Advanced techniques: when data is scarce

6.7 Hands-on lab

6.8 Summary