chapter two

2 Build a transformer

This chapter covers

How the attention mechanism uses query, key, and value to assign weights to elements in a sequence
Building an encoder-decoder transformer from scratch for language translation
Word embedding and positional encoding
Training a transformer from scratch to translate German to English

Understanding attention and transformer architectures is foundational for modern generative AI, especially for text-to-image models. There are two main reasons this chapter comes at the very beginning of our journey to build a text-to-image generator from scratch.

First, one of the most powerful approaches to text-to-image generation is based directly on transformers. As you’ll see in chapter 12, models like OpenAI’s DALL-E treat image generation as a sequence prediction task. An image is divided into patches (such as a 16 × 16 grid, resulting in 256 patches). The transformer then generates these patches one by one, predicting the next patch in the sequence based on the text prompt and the patches generated so far. This sequential prediction, rooted in the same mechanisms used in language translation, demonstrates why a deep understanding of attention and transformers is crucial. As you’ll discover in this chapter, the attention mechanism is the “secret sauce” that enables transformers to model complex relationships in sequences.

2.1 An overview of attention and transformers

2.1.1 How does the attention mechanism work

2.1.2 How to create a transformer

2.2 Word embedding and positional encoding

2.2.1 Word tokenization with the Spacy library

2.2.2 A sequence padding function

2.2.3 Input embedding from word embedding and positional encoding

2.3 Create an encoder-decoder transformer

2.3.1 Code the attention mechanism

2.3.2 Define the Transformer() class

2.3.3 Create a language translator

2.4 Train and use the German-to-English translator

2.4.1 Train the encoder-decoder transformer

2.4.2 Translate German to English with the trained model

2.5 Summary