appendix-d
appendix D Adding bells and whistles to the training loop
In this appendix, we enhance the training function for the pretraining and fine-tuning processes covered in chapters 5 to 7. In particular, it covers learning rate warmup, cosine decay, and gradient clipping. We then incorporate these techniques into the training function and pretrain an LLM.
To make the code self-contained, we reinitialize the model we trained in chapter 5:
import torch
from chapter04 import GPTModel
GPT_CONFIG_124M = {
"vocab_size": 50257, #1
"context_length": 256, #2
"emb_dim": 768, #3
"n_heads": 12, #4
"n_layers": 12, #5
"drop_rate": 0.1, #6
"qkv_bias": False #7
}
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
torch.manual_seed(123)
model = GPTModel(GPT_CONFIG_124M)
model.to(device)
model.eval()
After initializing the model, we need to initialize the data loaders. First, we load the “The Verdict” short story: