chapter four

4 Large language models

 

This chapter covers

  • Explaining why Transformers replaced earlier architectures.
  • Identifying what makes a language model “large”
  • Illustrating how models integrate multiple input modalities.
  • Analyzing how scale reshapes learning and costs.

Large language models occupy a central place in current debates about whether machines can think, not because they introduced a new theory of intelligence, but because they have become the most extensive attempt to reproduce intelligent behavior through computation. Their fluency, flexibility, and apparent grasp of meaning emerge from architectural and training choices that made it possible to build systems far larger and more exposed to language than any before.

This scaling did not come from a single breakthrough, but from the convergence of several forces: an architecture that can process language without sequential constraints, a notion of capacity tied to vast numbers of parameters, and a training process that links data and compute into a shared path of expansion.

4.1 The transformers architecture

4.1.1 Attention was all we needed

4.1.2 Providing a sense of order

4.1.3 Listening to the self

4.1.4 Multiple heads are better than one

4.1.5 Making depth trainable

4.1.6 One block, many behaviors

4.1.7 Connecting representations

4.1.8 From blocks to outputs

4.2 What makes a language model “large”

4.2.1 Where the learning fits

4.2.2 Large models through broad exposure

4.2.3 Scale comes with demands

4.3 From large to usable

4.3.1 Learning before specialization

4.3.2 Adapting from within

4.3.3 Transferring scale

4.3.4 Compressing computation

4.4 Somebody call the expert

4.4.1 Experts needed

4.4.2 Learning to specialize

4.4.3 Global scale, selective growth

4.5 Beyond text: multimodality

4.5.1 Worlds beyond words

4.5.2 A shared space

4.5.3 Architectures for multimodal fusion

4.5.4 Grounding without causation

4.6 The cost of intelligence

4.6.1 Training at scale

4.6.2 The daily cost of intelligence

4.6.3 Powering intelligence

4.6.4 Cost as a research constraint