9 Optimizing and scaling large language models
This chapter covers
- Model pruning and distillation
- Model sharding
- Inference-time optimization
- Extending context windows
The massive size of large language models (LLMs) is unique for deployment as well as training. Now that we’ve considered quantization and parameter efficient finetuning for training, we will shift our focus to deployment. In production, models must run efficiently on common hardware. Typically, this does not mean exotic supercomputing accelerators, but commodity GPUs such as A100 or H100 cards on cloud platforms or high-end RTX cards on workstations. While powerful, these devices are costly and resource-constrained, which makes efficiency a practical necessity rather than a luxury.
To meet this challenge, we explore techniques that turn research-grade models into deployable systems. These include pruning and distillation to shrink models while retaining most of their accuracy, sharding to distribute very large models across multiple devices, and inference-time optimizations such as FlashAttention and paged attention. We also look at advances in extending context windows, using methods like Rotary positional embeddings (RoPE), Yet Another RoPE eNlargement (YaRN), and interleaved RoPE (iRoPE) to push transformers from thousands of tokens to hundreds of thousands or even millions.