6 The Birth of Hyperscale (pretraining)
This chapter covers
- Scaling Canon: Kaplan, Chinchilla, and post-Chinchilla scaling laws
- Emergence and Sparks of AGI
- Smooth Loss, Jagged Evals, and Inverse Scaling, plus Benchmark Contamination and Saturation
- The Hidden Work: Data and Model Parallelism
- GPipe
Modern artificial intelligence has discovered that bigger models perform better. The gains diminish as size increases, but the relationship is sufficiently consistent to provide researchers with a roadmap for progress.[1]
The most explicit demonstration came in the leap from GPT-2 (2019) to GPT-3 (2020). GPT-2 was modest by today’s standards. It had 1.5 billion parameters and was trained on a relatively narrow dataset (Reddit). Still, it captured the public imagination with examples such as Ovid’s Unicorn. Ovid’s Unicorn was a fabricated news story generated about a herd of English-speaking unicorns discovered in the Andes Mountains. The piece was so convincing in tone and style that it reset expectations about machine-generated text. However, it was cherry-picked and included oxymoronic references, such as “four-horned unicorns,” and in another sample, confidently describes “fires happening underwater,” suggesting a fragile underlying “world model.”[2] While GPT-2 could dazzle, it could not sustain that level of performance.