2 Working with text data

This chapter covers

Preparing text for large language model training
Splitting text into word and subword tokens
Byte pair encoding as a more advanced way of tokenizing text
Sampling training examples with a sliding window approach
Converting tokens into vectors that feed into a large language model

So far, we’ve covered the general structure of large language models (LLMs) and learned that they are pretrained on vast amounts of text. Specifically, our focus was on decoder-only LLMs based on the transformer architecture, which underlies the models used in ChatGPT and other popular GPT-like LLMs.

During the pretraining stage, LLMs process text one word at a time. Training LLMs with millions to billions of parameters using a next-word prediction task yields models with impressive capabilities. These models can then be further finetuned to follow general instructions or perform specific target tasks. But before we can implement and train LLMs, we need to prepare the training dataset, as illustrated in figure 2.1.

Figure 2.1 The three main stages of coding an LLM. This chapter focuses on step 1 of stage 1: implementing the data sample pipeline.

2.1 Understanding word embeddings

2 Working with text data

This chapter covers

Figure 2.1 The three main stages of coding an LLM. This chapter focuses on step 1 of stage 1: implementing the data sample pipeline.

2.1 Understanding word embeddings

2.2 Tokenizing text

2.3 Converting tokens into token IDs

2.4 Adding special context tokens

2.5 Byte pair encoding

2.6 Data sampling with a sliding window

2.7 Creating token embeddings

2.8 Encoding word positions

Summary