chapter six

6 The Birth of Hyperscale (pretraining)

This chapter covers

Scaling Canon: Kaplan, Chinchilla, and post-Chinchilla scaling laws
Emergence and Sparks of AGI
Smooth Loss, Jagged Evals, and Inverse Scaling, plus Benchmark Contamination and Saturation
The Hidden Work: Data and Model Parallelism
GPipe

Modern artificial intelligence has discovered that bigger models perform better. The gains diminish as size increases, but the relationship is sufficiently consistent to provide researchers with a roadmap for progress.[1]

The most explicit demonstration came in the leap from GPT-2 (2019) to GPT-3 (2020). GPT-2 was modest by today’s standards. It had 1.5 billion parameters and was trained on a relatively narrow dataset (Reddit). Still, it captured the public imagination with examples such as Ovid’s Unicorn. Ovid’s Unicorn was a fabricated news story generated about a herd of English-speaking unicorns discovered in the Andes Mountains. The piece was so convincing in tone and style that it reset expectations about machine-generated text. However, it was cherry-picked and included oxymoronic references, such as “four-horned unicorns,” and in another sample, confidently describes “fires happening underwater,” suggesting a fragile underlying “world model.”[2] While GPT-2 could dazzle, it could not sustain that level of performance.

6.1 Scaling Canon

6.1.1 Kapan Scaling Laws

6.1.2 Does Shape Matter?

6.1.3 Scaling with Data and Model Size

6.1.4 Scaling Compute

6.1.5 Architecture Comparisons

6.1.6 The Scaling Hypothesis

6.1.7 Chinchilla and Post-Chinchilla

6.2 Scaling Detours

6.2.1 Jaggedness

6.2.2 Contamination

6.3 Parallelism

6.3.1 GPipe

6.3.2 Results

6.3.3 Wise to Pipeline?

6.4 Rounding Errors