chapter twelve
12 From recurrence to attention: Google Brain and the transformer architecture
This chapter covers
- Ashish Vaswani et al.’s Attention Is All You Need (2017) and the break from sequential state propagation toward relational modeling
- Why modeling sequence as compressed memory constrains learning—and how attention preserves structure across distance
- How next-token prediction scales from local competition to coherent, multi-paragraph generation
- How self-attention—queries, keys, values, and masking—constructs context through learned representation
- Why relational representation stands as a synthesis of probability, information, and generalization in modern AI
In 2015, the so-called godfathers of deep learning—Yann LeCun, Yoshua Bengio, and Geoffrey Hinton—argued that neural networks had effectively “arrived.” These layered models were no longer academic curiosities. They powered speech recognition, image classification, and early language systems at scale, learning internal representations directly from raw data and replacing manual feature design with end-to-end optimization.