chapter two

2 An end-to-end rearchitecting project

 

This chapter covers

  • Establishing baseline capabilities and inference metrics
  • Applying depth pruning to remove layers
  • Measuring structural modification impacts
  • Recovering knowledge through distillation
  • Creating smaller, faster models

In the previous chapter, we introduced the model-tailoring pipeline and explained that the key differentiator in our approach to creating efficient solutions is rearchitecting the models.

In this chapter, you’ll get hands-on by adapting a model. We’ll treat this as a professional assignment: we have a base model that works very well, and we need to reduce its size while preserving as many of its capabilities as possible.

NOTE

This is very common in the industry. For example, NVIDIA combines structural optimization (through pruning techniques) and knowledge recovery (using knowledge distillation) to create its model families. That way, it only needs to fully train the largest model in the family.

To create the model, we'll run two phases of the model tailoring pipeline that we introduced in the previous chapter (see figure 1.2): Structural Optimization, where we'll modify the model's architecture to make it lighter and faster, and Knowledge Recovery, where we transfer knowledge from the base model to the new pruned model. Our focus here is general-purpose optimization. We're not tailoring the model for a specific domain or task, which is the third phase of the pipeline we'll explore in later chapters.

2.1 The rearchitecting workflow

2.2 Establishing the baseline

2.3 Applying depth pruning

2.4 Evaluating the impact of pruning

2.5 Recovering knowledge

2.6 Analyzing the final result

2.7 Hands-on lab

2.8 Summary