chapter eleven

11 Large Language Models (LLMs)

 

This chapter covers

  • Understanding the intuition of large language models
  • Identifying and preparing LLM training data
  • Deeply understanding the operations in training a large language model
  • Implementation details and LLM tuning approaches

11.1 What are large language models?

Large language models (LLMs) are machine learning models that are specialized for natural language processing problems, like language generation. Consider the autocomplete feature on your mobile device’s keyboard (figure 11.1). When you start typing “Hey, what are…”, the keyboard likely predicts that the next word is “you”, “we”, or “the”, because these are the most common next words after the phrase. It makes this choice by scanning a table of probabilities that was trained on commonly available pieces of content - this simple table is a language model.

Figure 11.1 Example of autocomplete as a language model

A large language model (LLM) is exactly the same idea, with some fundamental upgrades to enable interesting capabilities that come with predicting more than one word at a time:

11.2 The intuition behind language prediction

11.2.1 Why the size of tokens and parameters matter

11.2.2 An LLM training workflow

11.3 Preparing training data

11.3.1 Selecting and collecting data

11.3.2 Cleaning and preprocessing data

11.4 Encoding: From text to numbers

11.4.1 Tokenization

11.4.2 Vectorization

11.5 Designing the ANN architecture (And why transformers)

11.6 Encoding: Creating trainable embeddings

11.6.1 Sampling a batch of tokens

11.6.2 Creating a trainable embedding matrix

11.6.3 Creating positional encodings

11.6.4 Combining the embedding matrix and positional encodings

11.7 Self-attention: Start training the LLM

11.7.1 Linear weight matrix projections

11.7.2 Ask every other token

11.7.3 Calculating attention weights

11.7.4 Weighted sum

11.7.5 Multiple attention heads

11.7.6 Layer normalization

11.8 Decoding: Meaning through neural networks

11.8.1 Project up layer

11.8.2 Project down layer

11.8.3 Layer normalization

11.8.4 Stacking Transformer blocks

11.8.5 Making a prediction