appendix D Adding bells and whistles to the training loop
In this appendix, we enhance the training function for the pretraining and fine-tuning processes covered in chapters 5 to 7. In particular, it covers learning rate warmup, cosine decay, and gradient clipping. We then incorporate these techniques into the training function and pretrain an LLM.
To make the code self-contained, we reinitialize the model we trained in chapter 5:
import torch from chapter04 import GPTModel GPT_CONFIG_124M = { "vocab_size": 50257, #1 "context_length": 256, #2 "emb_dim": 768, #3 "n_heads": 12, #4 "n_layers": 12, #5 "drop_rate": 0.1, #6 "qkv_bias": False #7 } device = torch.device("cuda" if torch.cuda.is_available() else "cpu") torch.manual_seed(123) model = GPTModel(GPT_CONFIG_124M) model.to(device) model.eval()
After initializing the model, we need to initialize the data loaders. First, we load the “The Verdict” short story: