chapter nine

9 Optimizing and scaling large language models

This chapter covers

Model pruning and distillation
Model sharding
Inference-time optimization
Extending context windows

The massive size of large language models (LLMs) is unique for deployment as well as training. Now that we’ve considered quantization and parameter efficient finetuning for training, we will shift our focus to deployment. In production, models must run efficiently on common hardware. Typically, this does not mean exotic supercomputing accelerators, but commodity GPUs such as A100 or H100 cards on cloud platforms or high-end RTX cards on workstations. While powerful, these devices are costly and resource-constrained, which makes efficiency a practical necessity rather than a luxury.

To meet this challenge, we explore techniques that turn research-grade models into deployable systems. These include pruning and distillation to shrink models while retaining most of their accuracy, sharding to distribute very large models across multiple devices, and inference-time optimizations such as FlashAttention and paged attention. We also look at advances in extending context windows, using methods like Rotary positional embeddings (RoPE), Yet Another RoPE eNlargement (YaRN), and interleaved RoPE (iRoPE) to push transformers from thousands of tokens to hundreds of thousands or even millions.

9.1 Model optimization

9.1.1 Model pruning

9.1.2 Model distillation

9 Optimizing and scaling large language models

This chapter covers

9.1 Model optimization

9.1.1 Model pruning

9.1.2 Model distillation

9.2 Sharding for memory optimization

9.3 Inference optimization

9.4 GPU-level optimization: Tiling, threads, and memory

9.4.1 FlashAttention: Tiled attention at scale

9.5 Extending long-context windows

9.5.1 Rotary embeddings and refinements

9.5.2 Refinements: YaRN, positional interpolation, and iRoPE

Summary