chapter five

5 Attention is all you need

This chapter covers

How the Transformer replaced recurrence and convolution with self-attention.
Core mechanics: Q/K/V, multi-head attention, and positional encodings.
Seq2seq, “thought vectors,” the reversal trick, and Order Matters.
Cultural impact of the Transformer and the “X is All You Need” meme.
The Universal Transformer and OpenAI’s scaling path.

In 2017, researchers from Google replaced recurrence and convolutions with a new architecture based on attention, which they named the Transformer.[1] The authors stated: “We show that the Transformer outperforms both recurrent and convolutional models on academic English to German and English to French translation benchmarks.”[2] Beneath this measured academic tone was the start of a revolution.

5.1 Transformer

5.1.1 Empirical Results

5.1.2 Universal Transformer

5.2 The Annotated Transformer

5.2.1 Thought Vectors

5.3 Just Point to It!

5.3.1 Experimental Results

5.4 Order Matters

5.5 Reversing Input Sentences

5.6 Impact