9 Optimizing and scaling large language models
The massive size of large language models brings unique for deployment as well as training.Now that’s we’ve considered quantization and parameter efficient fine tuning (PEFT) for training, we will shift our focus to deployment. In production, models must run efficiently on common hardware. This is typically not exotic supercomputing accelerators but commodity GPUs such as A100 or H100 cards on cloud platforms, or high-end RTX cards on workstations. While powerful, these devices are costly and resource-constrained, which makes efficiency a practical necessity rather than a luxury.
To meet this challenge, we explore techniques that turn research-grade models into deployable systems. These include pruning and distillation to shrink models while retaining most of their accuracy, sharding to distribute very large models across multiple devices, and inference-time optimizations such as FlashAttention and paged attention. We also look at advances in extending context windows, using methods like RoPE, YaRN, and iRoPE to push transformers from thousands of tokens to hundreds of thousands or even millions.
Together, these strategies define the toolkit for optimizing and scaling LLMs. They bridge the gap between theoretical performance and practical utility, ensuring that models are not only powerful but also usable in real-world environments.