chapter two

2 Training large language models: Learning at scale

This chapter covers

How LLMs and multimodal models are trained
Exploring efficient architectures, such as mixture-of-experts and sparse models
Improving performance through post-training and inference-time techniques
Emergent properties of LLMs

For decades, the digital economy has run on the currency of data. The digital economy of collecting and trading information about who we are and what we do online is worth trillions of dollars. As more of our daily activities have moved on to the internet, the mill has ever more grist to grind through. Large language models (LLMs) are inventions of the internet age, emulating human language by vacuuming up terabytes of text, image, and video data found online.

This scale has demanded new approaches to make models not just larger, but more specialized and adaptable. Researchers have developed innovative techniques to make these models more capable and efficient, including multimodal training that allows models to simultaneously process images and text. However, simply scaling up models isn’t always practical, which has driven innovations like knowledge transfer techniques to create more efficient models and Mixture of Experts (MoE) architectures that allow for larger, more specialized models without proportional increases in computational costs. Other strategies, such as test-time scaling and post-training techniques, help refine model behavior after pre-training is complete.

2.1 How are LLMs trained?

2.1.1 Exploring open web data collection

2.1.2 Demystifying autoregression and bidirectional token prediction

2.2 Training multimodal LLMs

2.3 Transferring knowledge for efficient models

2.4 Mixture-of-experts and sparse models

2.5 Reasoning models

2.6 Techniques for post-training LLMs

2.6.1 Supervised fine-tuning

2.6.2 Reinforcement learning from human feedback

2.6.3 Direct preference optimization

2.6.4 Reinforcement learning from AI feedback

2.7 Emergent properties of LLMs

2.7.1 Learning with a few examples

2.7.2 Is emergence an illusion?

2.8 Conclusion

2.9 Summary