chapter twelve

12 From recurrence to attention: Google Brain and the transformer architecture

 

This chapter covers

  • Ashish Vaswani et al.’s Attention Is All You Need (2017) and the break from sequential state propagation toward relational modeling
  • Why modeling sequence as compressed memory constrains learning—and how attention preserves structure across distance
  • How next-token prediction scales from local competition to coherent, multi-paragraph generation
  • How self-attention—queries, keys, values, and masking—constructs context through learned representation
  • Why relational representation stands as a synthesis of probability, information, and generalization in modern AI

In 2015, the so-called godfathers of deep learning—Yann LeCun, Yoshua Bengio, and Geoffrey Hinton—argued that neural networks had effectively “arrived.” These layered models were no longer academic curiosities. They powered speech recognition, image classification, and early language systems at scale, learning internal representations directly from raw data and replacing manual feature design with end-to-end optimization.

12.1 From sequential processing to relational representation

12.1.1 Sequence as a modeling assumption

12.1.2 The compression bottleneck

12.1.3 From sequential compression to relational attention

12.1.4 From representation to capability

12.2 Worked examples of attention in action

12.2.1 Example 1: resolving a pronoun through attention

12.2.2 Example 2: factual recall and semantic alignment

12.2.3 Example 3: multi-step autoregressive reasoning

12.3 Intellectual positioning

12.3.1 Foundations leveraged

12.3.2 Foundations reconsidered

12.3.3 What Is truly new

12.4 Why attention endures

12.4.1 Scalability

12.4.2 Modularity

12.4.3 Mathematical Simplicity

12.4.4 Cross-Domain Adaptability

12.4.5 Empirical scaling and ecosystem reinforcement

12.4.6 Constraints, evolution, and limits

12.4.7 Closing thoughts

12.5 Conclusion and transition

12.6 Summary