chapter three
3 A blueprint to modern transformers
This chapter covers
- Classical architecture with DistilGPT
- Modern optimizations: GQA and GLU
- Mapping optimizations to model components
In the previous chapter, you completed your first re-architecture project: you removed layers from a Transformer model and recovered its capabilities, through knowledge distillation, from the base model. The result was a completely new model based on a Gemma-3 model.
In this chapter, we'll study the anatomy of modern Transformers, from classic to the most advanced architectures. We'll identify the specific components that you'll optimize in the upcoming chapters, and with this, you'll gain the fundamental basis to decide which re-architecture technique to use in each project. The concepts in this chapter are the cornerstone for understanding the optimization techniques you'll apply in the following chapters.