chapter two

2 Training large language models

This chapter covers

Explaining how LLMs are trained
Introducing the emergent properties of LLMs
Exploring the harms and vulnerabilities that come from training LLMs

For decades, the digital economy has run on the currency of data. The digital economy of collecting and trading information about who we are and what we do online is worth trillions of dollars, and as more of our daily activities have moved on to the internet, the mill has ever more grist to grind through. Large language models (LLMs) are inventions of the internet age, emulating human language by vacuuming up terabytes of text data found online.

The process has yielded both predictable and unpredictable results. Notably, there are significant questions about both what is in the datasets used by LLMs and how to prevent the models from replicating some of the more objectionable text they hold in their training sets. With data collection at this scale, the collection of personal information and low-quality, spammy, or offensive content is expected, but how to address the problem is another challenge. LLMs at the scale we’re now seeing have exhibited a host of capabilities that don’t seem to be available to smaller language models. These properties make LLMs more attractive for a variety of uses and ensure that the race toward more and more data and bigger and bigger models won’t end anytime soon.

How are LLMs trained?

Exploring open web data collection

Demystifying autoregression and bidirectional token prediction

2 Training large language models

This chapter covers

How are LLMs trained?

Exploring open web data collection

Demystifying autoregression and bidirectional token prediction

Fine-tuning LLMs

The unexpected: Emergent properties of LLMs

Quick study: Learning with few examples

Is emergence an illusion?

What’s in the training data?

Encoding bias

Sensitive information

Summary