6 Knowledge recovery through distillation
This chapter covers
- Deciding what to prune before training
- Transferring knowledge to smaller models
- Recovering lost capabilities efficiently
- Applying proven recovery strategies
Structural pruning — both depth and width — tends to lead to a loss of model knowledge; the more you prune, the greater that loss. Recovering that knowledge is one of the most critical stages in the model rearchitecting pipeline: it's what determines whether the model can match, or even surpass, the base model.
That recovery depends on two things: which parts you've removed, and how you train to recover the lost knowledge. The first lesson I learned about this came before any training loop.
I remember the first day I compared two recovery experiments. Both started from models with the same number of Transformer blocks removed, but in one I had removed the blocks by importance and in the other I had removed the last blocks. The difference was striking: the model with importance-based pruning recovered its performance in a fraction of the training time. That moment changed how I saw optimization. Knowledge recovery doesn't start when we apply Knowledge Distillation; it starts when we decide which parts of the model to keep.
In Chapter 2, you saw the fundamentals of Knowledge Distillation: how a teacher model transfers its knowledge to a smaller student using only the model outputs (soft labels).