chapter nine

9 How transformers work

This chapter covers

An explanation of the text generation problem.
An introduction to unsupervised learning.
Learning structure using attention mechanism.
Building up from simple probabilistic models to deep learning models.
The transformer architecture and its variants and applications.

While earlier chapters have showcased deep learning’s capabilities in regression and classification, the true transformative power of this technology extends far beyond analyzing existing data. Deep learning now ventures into creative territory—generating entirely new images, composing original text, and even producing realistic videos. These generative capabilities, once considered to be within the exclusive purview of human intelligence, have become central to the current AI revolution, fueling much of the AI boom and enthusiasm we’ve witnessed in recent years.

9.1 A motivating example: generating names character by character

9.2 Self-supervised learning

9.2.1 Limits of the Bigram Model

9.3 Generating our training data

9.4 Embeddings and multi-layer perceptrons

9.4.1 Visualizing embeddings

9.5 Attention

9.5.1 Dot Product Self-attention

9.5.2 Scaled dot product causal self-attention

9.6 Transformers

9.6.1 The Decoder

9.7 Other Transformer Architectures

9.7.1 The Encoder

9.7.2 The Encoder Decoder

9.8 Tokenization

9.8.1 Generating Sentences

9.9 The Vision Transformer

9.10 Conclusion

9.11 Exercises

9.12 Summary