chapter five
5 Attention is all you need
This chapter covers
- How the Transformer replaced recurrence and convolution with self-attention.
- Core mechanics: Q/K/V, multi-head attention, and positional encodings.
- Seq2seq, “thought vectors,” the reversal trick, and Order Matters.
- Cultural impact of the Transformer and the “X is All You Need” meme.
- The Universal Transformer and OpenAI’s scaling path.
In 2017, researchers from Google replaced recurrence and convolutions with a new architecture based on attention, which they named the Transformer.[1] The authors stated: “We show that the Transformer outperforms both recurrent and convolutional models on academic English to German and English to French translation benchmarks.”[2] Beneath this measured academic tone was the start of a revolution.