2 Working with Text Data

This chapter covers

Preparing text for large language model training
Splitting text into word and subword tokens
Byte pair encoding as a more advanced way of tokenizing text
Sampling training examples with a sliding window approach
Converting tokens into vectors that feed into a large language model

In the previous chapter, we delved into the general structure of large language models (LLMs) and learned that they are pretrained on vast amounts of text. Specifically, our focus was on decoder-only LLMs based on the transformer architecture, which underlies ChatGPT and other popular GPT-like LLMs.

During the pretraining stage, LLMs process text one word at a time. Training LLMs with millions to billions of parameters using a next-word prediction task yields models with impressive capabilities. These models can then be further finetuned to follow general instructions or perform specific target tasks. But before we can implement and train LLMs in the upcoming chapters, we need to prepare the training dataset, which is the focus of this chapter, as illustrated in Figure 2.1

Figure 2.1 A mental model of the three main stages of coding an LLM, pretraining the LLM on a general text dataset, and finetuning it on a labeled dataset. This chapter will explain and code the data preparation and sampling pipeline that provides the LLM with the text data for pretraining.

2.1 Understanding word embeddings

2 Working with Text Data

This chapter covers

Figure 2.1 A mental model of the three main stages of coding an LLM, pretraining the LLM on a general text dataset, and finetuning it on a labeled dataset. This chapter will explain and code the data preparation and sampling pipeline that provides the LLM with the text data for pretraining.

2.1 Understanding word embeddings

2.2 Tokenizing text

2.3 Converting tokens into token IDs

2.4 Adding special context tokens

2.5 Byte pair encoding

2.6 Data sampling with a sliding window

2.7 Creating token embeddings

2.8 Encoding word positions

2.9 Summary