chapter four

4 Building smaller and faster LLMs with depth pruning

 

This chapter covers

  • Fundamentals and benefits of depth pruning
  • Static vs. data-driven layer selection strategies
  • Measuring layer contribution with cosine similarity
  • Hands-on analysis using PyTorch hooks
  • Research insights from "Shortened LLaMA" paper

In my first re-architectures I found myself wondering which transformer blocks I should remove. As is commonly done, I used the heuristic of always leaving the first and last blocks untouched, something that worked especially well in general-purpose models. But when creating a model with a specific task, I had doubts about whether I was removing the right blocks, or not.

So that you are always confident in your choices, in this chapter we continue the work started in Chapters 2 and 3. We'll not only look at methods based on the model's static structure, but we'll also adopt a data-driven approach. You'll learn to measure each block's contribution depending on the data the resulting model will work with. To do this we'll need to "spy on" the model's internal activations.

The way we'll do this will be eminently practical. We'll use two very different datasets with the same model, a general text corpus like WikiText and an SMS classification dataset. We'll use them to visualize that the importance of the different blocks that make up the model is not absolute or static, but depends directly on the data and the task it must perform.

4.1 Fundamentals and benefits of depth pruning

4.1.1 How depth pruning works

4.1.2 The benefits: memory consumption and speed

4.2 Static block selection

4.2.1 Removing first, last, middle blocks

4.2.2 Removing by weight

4.2.3 Static vs. weights

4.3 Data-Driven block selection

4.3.1 Using PyTorch hooks

4.3.2 Understanding cosine similarity

4.3.3 Analyzing block contributions across different datasets

4.3.4 Choosing the blocks to discard

4.3.5 Analysis of the benchmarks

4.4 From paper to practice

4.5 Hands-on lab

4.6 Summary