Part 1 Understanding attention
and transformers
This part introduces the foundations of transformer architectures, which have become the backbone of modern generative AI. We begin with a comparison of transformers and diffusion models, showing how each tackles the problem of generating data in very different ways. From there, you’ll build a transformer from scratch in chapter 2 to translate German to English, gaining hands-on experience with the attention mechanism that enables these models to capture relationships across sequences.
We then explore practical applications of transformers in computer vision and multimodal tasks. You’ll implement a vision transformer (ViT) to classify images in chapter 3 and build a multimodal transformer to generate captions for images in chapter 4, bridging the gap between visual and textual data. By the end of this part, you’ll understand how transformers adapt naturally from text to images, as well as why attention has become the most influential concept in modern AI.