chapter two

2 A deeper look into transformers

This chapter covers

Sequence modeling before transformers
Core components of a transformer model
Attention mechanism and its variants
How transformers can help stabilize gradient propagation

If you’ve interacted with transformer-based tools like ChatGPT, you’ve experienced firsthand how effectively large language models (LLMs) can interpret and generate natural language. But to truly succeed when applying these models to your own tasks, simply importing a prebuilt pipeline isn’t enough. Whether you’re fine-tuning an LLM, troubleshooting unexpected performance problems, optimizing GPU resources, or exploring advanced architectures such as mixture-of-experts (MoE) or parameter-efficient techniques like LoRA, you’ll need a solid understanding of the transformer’s inner workings.

In this chapter, we’ll demystify the seemingly complex transformer architecture by breaking it down into foundational concepts such as self-attention, multihead attention, feed-forward networks (FFNs), and positional encoding. Understanding these core components will empower you not only to use existing language models confidently but also to adapt and optimize them effectively for your real-world production scenarios.

2.1 From seq-2-seq models to transformers

2.1.1 The difficulty of training RNNs

2.1.2 Introducing attention mechanisms

2 A deeper look into transformers

This chapter covers

2.1 From seq-2-seq models to transformers

2.1.1 The difficulty of training RNNs

2.1.2 Introducing attention mechanisms

2.1.3 Vanishing gradients: Transformer to the rescue

2.1.4 Exploding gradients: When large gradients disrupt training

2.2 Model architecture

2.2.1 Encoder and decoder stacks

2.2.2 Positional encoding

2.2.3 Attention

2.2.4 Position-wise FFNs

Summary