chapter three

3 A blueprint to modern transformers

This chapter covers

Classical architecture with DistilGPT
Modern optimizations: GQA and GLU
Mapping optimizations to model components

In the previous chapter, you completed your first re-architecture project: you removed layers from a Transformer model and recovered its capabilities, through knowledge distillation, from the base model. The result was a completely new model based on a Gemma-3 model.

In this chapter, we'll study the anatomy of modern Transformers, from classic to the most advanced architectures. We'll identify the specific components that you'll optimize in the upcoming chapters, and with this, you'll gain the fundamental basis to decide which re-architecture technique to use in each project. The concepts in this chapter are the cornerstone for understanding the optimization techniques you'll apply in the following chapters.

3.1 Classical architecture: DistilGPT2

3.1.1 General behaviour of a Transformer model

3.1.2 Classical attention mechanism

3 A blueprint to modern transformers

This chapter covers

3.1 Classical architecture: DistilGPT2

3.1.1 General behaviour of a Transformer model

3.1.2 Classical attention mechanism

3.1.3 The classic MLP mechanism

3.1.4 The Transformer dimensions: depth and width

3.2 The modern Transformer architecture

3.2.1 Optimized attention: from Multi-Head Attention (MHA) to Grouped-Query Attention (GQA)

3.2.2 The evolution of the MLP: from simple expansion to Gated Linear Units (GLU)

3.3 Connecting structure, behavior, and optimization

3.4 Hands-on lab

3.5 Summary