appendix-d

Appendix D. Adding Bells and Whistles to the Training Loop

 

In the appendix, we enhance the training function for the pretraining and finetuning processes covered in chapters 5-7. This appendix, in particular, covers learning rate warmup, cosine decay, and gradient clipping in the first three sections.

The final section then incorporates these techniques into the training function developed in chapter 5 and pretrains an LLM.

To make the code in this appendix self-contained, we reinitialize the model we trained in chapter 5.

import torch
from previous_chapters import GPTModel
 
GPT_CONFIG_124M = {
    "vocab_size": 50257,  # Vocabulary size
    "ctx_len": 256,       # Shortened context length (orig: 1024)
    "emb_dim": 768,       # Embedding dimension
    "n_heads": 12,        # Number of attention heads
    "n_layers": 12,       # Number of layers
    "drop_rate": 0.1,     # Dropout rate
    "qkv_bias": False     # Query-key-value bias
}
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
torch.manual_seed(123)
model = GPTModel(GPT_CONFIG_124M)
model.eval()

After initializing the model, we also need to initialize the data loaders we used in chapter 5. First, we load the "The Verdict" short story:

D.1 Learning rate warmup

D.2 Cosine decay

D.3 Gradient clipping

D.4 The modified training function