4 Large language models
This chapter covers
- Explaining why Transformers replaced earlier architectures.
- Identifying what makes a language model “large”
- Illustrating how models integrate multiple input modalities.
- Analyzing how scale reshapes learning and costs.
Large language models occupy a central place in current debates about whether machines can think, not because they introduced a new theory of intelligence, but because they have become the most extensive attempt to reproduce intelligent behavior through computation. Their fluency, flexibility, and apparent grasp of meaning emerge from architectural and training choices that made it possible to build systems far larger and more exposed to language than any before.
This scaling did not come from a single breakthrough, but from the convergence of several forces: an architecture that can process language without sequential constraints, a notion of capacity tied to vast numbers of parameters, and a training process that links data and compute into a shared path of expansion.