2 A deeper look into transformers

 

This chapter covers

  • Sequence modeling before transformers
  • Core components of a transformer model
  • Attention mechanism and its variants
  • How transformers can help stabilize gradient propagation

If you’ve interacted with transformer-based tools like ChatGPT, you've experienced firsthand how effectively LLMs can interpret and generate natural language. But to truly succeed when applying these models to your own tasks, simply importing a pre-built pipeline isn't enough. Whether you're fine-tuning an LLM, troubleshooting unexpected performance issues, optimizing GPU resources, or exploring advanced architectures such as mixture-of-experts (MoE) or parameter-efficient techniques like LoRA, you'll need a solid understanding of the transformer’s inner workings.

In this chapter, we'll demystify the seemingly complex transformer architecture by breaking it down into foundational concepts such as self-attention, multi-head attention, feed-forward networks, and positional encoding. Understanding these core components will empower you not only to use existing language models confidently but also to adapt and optimize them effectively for your real-world production scenarios..

2.1 From seq-2-seq models to transformers

2.1.1 The difficulty of training RNNs

2.1.2 Introducing attention mechanisms

2.1.3 Vanishing gradients: transformer to the rescue

2.1.4 Exploding gradients: when large gradients disrupt training

2.2 Model architecture

2.2.1 Encoder and decoder stacks

2.2.2 Positional encoding

2.2.3 Attention

2.2.4 Position-wise feed-forward networks

2.3 Summary