chapter six
6 The Birth of Hyperscale
This chapter covers
- Scaling Canon: Kaplan, Chinchilla, and post-Chinchilla scaling laws
- Sparks of AGI and emergent behavior
- Smooth Loss, Jagged Evals, and Inverse Scaling
- The Hidden Work: Data and Model Parallelism
- GPipe and Pipeline Parallelism
Modern artificial intelligence has revealed that bigger models perform better. The most explicit demonstration came with the leap in performance from GPT-2 (2019) to GPT-3 (2020). GPT-2 was modest by today’s standards: it had 1.5 billion parameters and was trained on a relatively narrow dataset (Reddit). Still, it captured the public imagination with examples such as Ovid’s Unicorn (chapter 1). The piece was so convincing in tone and style that it reset expectations about machine-generated text, to the point that almost everyone overlooked that it was cherry-picked and included oxymoronic references such as “four-horned unicorns.”