This chapter covers
- Explaining how LLMs are trained
- Introducing the emergent properties of LLMs
- Exploring the harms and vulnerabilities that come from training LLMs
For decades, the digital economy has run on the currency of data. The digital economy of collecting and trading information about who we are and what we do online is worth trillions of dollars, and as more of our daily activities have moved on to the internet, the mill has ever more grist to grind through. Large language models (LLMs) are inventions of the internet age, emulating human language by vacuuming up terabytes of text data found online.
The process has yielded both predictable and unpredictable results. Notably, there are significant questions about both what is in the datasets used by LLMs and how to prevent the models from replicating some of the more objectionable text they hold in their training sets. With data collection at this scale, the collection of personal information and low-quality, spammy, or offensive content is expected, but how to address the problem is another challenge. LLMs at the scale we’re now seeing have exhibited a host of capabilities that don’t seem to be available to smaller language models. These properties make LLMs more attractive for a variety of uses and ensure that the race toward more and more data and bigger and bigger models won’t end anytime soon.