2 Build a transformer
This chapter covers
- How the attention mechanism uses query, key, and value to assign weights to elements in a sequence
- Building an encoder-decoder transformer from scratch for language translation
- Word embedding and positional encoding
- Training a transformer from scratch to translate German to English
Understanding attention and transformer architectures is foundational for modern generative AI, especially for text-to-image models. There are two main reasons this chapter comes at the very beginning of our journey to build a text-to-image generator from scratch.
First, one of the most powerful approaches to text-to-image generation is based directly on transformers. As you’ll see in chapter 12, models like OpenAI’s DALL-E treat image generation as a sequence prediction task. An image is divided into patches (such as a 16 × 16 grid, resulting in 256 patches). The transformer then generates these patches one by one, predicting the next patch in the sequence based on the text prompt and the patches generated so far. This sequential prediction, rooted in the same mechanisms used in language translation, demonstrates why a deep understanding of attention and transformers is crucial. As you’ll discover in this chapter, the attention mechanism is the “secret sauce” that enables transformers to model complex relationships in sequences.