chapter five

5 Shaping model architectures via width pruning

 

This chapter covers

  • Understanding width pruning trade-offs
  • Creating specialized task models
  • Building models using your own data
  • Measuring speed, energy, and reasoning

The most direct structural pruning technique is depth pruning, which is especially useful if your main concern is inference performance. It's a very aggressive technique that removes a complete Transformer block.

A more refined technique is to target specific neurons in a Multi-Layer Perceptron (MLP) module with a Gated Linear Unit (GLU) structure, the gating mechanism we explored in chapter 3. It's a technique that can be used with precision, not just to improve performance, but, as you'll see, to change the model's personality. The ideal candidate for this type of optimization are wide models. In this chapter, we’ll use the Llama-3.2-1B model with a x4 expansion in its MLP layer.

We'll focus mostly on the advantages this technique can bring in terms of performance, but its uses go beyond that. To measure the general impact on performance, capabilities, and consumption, we'll introduce a more complete benchmarking system (whose technical details are in the appendices) that we'll use throughout the rest of the book.

We'll approach this technique from two complementary angles:

5.1 Selecting static neurons

5.1.1 Identifying neurons for removal

5.1.2 Reconstructing the MLP module

5.1.3 Measuring the impact: benchmarks and trade-off

5.1.4 Data-driven neuron selection

5.1.5 Extracting the dynamic importance component

5.1.6 Reconstructing the MLP with hybrid scores

5.1.7 Running the calibration loop

5.1.8 Creating domain-specialized models

5.1.9 Measuring the specialization effect

5.1.10 Inference performance and energy efficiency

5.2 From paper to practice

5.3 Hands-on lab

5.4 Summary