chapter nine

9 How transformers work

 

This chapter covers

  • An explanation of the text generation problem
  • An introduction to unsupervised learning
  • Learning structure using an attention mechanism
  • Building up from simple probabilistic models to deep learning models
  • The transformer architecture and its variants and applications

While earlier chapters have showcased deep learning’s capabilities in regression and classification, the true transformative power of this technology extends far beyond analyzing existing data. Deep learning is now venturing into creative territory—generating entirely new images, composing original text, and even producing realistic videos. These generative capabilities, once considered to be within the exclusive purview of human intelligence, have become central to the current AI revolution, fueling much of the AI boom and enthusiasm we’ve witnessed in recent years.

9.1 A motivating example: Generating names character by character

9.2 Self-supervised learning

9.2.1 Limits of the bigram model

9.3 Generating our training data

9.4 Embeddings and linear layers

9.4.1 Visualizing embeddings

9.5 Attention

9.5.1 Dot product self-attention

9.5.2 Scaled dot product causal self-attention

9.6 Transformers

9.6.1 The decoder

9.7 Other Transformer architectures

9.7.1 The encoder

9.7.2 The encoder-decoder

9.8 Tokenization

9.8.1 Generating sentences

9.9 The Vision Transformer

9.10 Conclusion

9.11 Exercises

Summary