9 Advanced quantization techniques
This chapter covers
- Using the FlexGen technique to offload part of an LLM to memory or disk
- Using SmoothQuant, an advanced post-training quantization technique, to reduce memory footprint and accelerate inference
- Using BitNet, a scalable 1-bit Transformer architecture, to reduce memory footprint and energy consumption
Typically, domain-specific language models are small: in my professional experience, they’re usually no more than 7 or 8 billion parameters (excluding the additional parameters from LoRA or QLoRA partial tuning, which are small anyway). The techniques explained in previous chapters make such models practical to deploy and use in computationally constrained environments. Still, a larger pool of specialized training data, unstructured training data with representations larger than natural language, or a larger baseline model that better fits the task may result in a final model size that exceeds the available inference hardware capacity, even after quantization.
This chapter introduces advanced LLM quantization techniques that can help you run such specialized models efficiently in target environments, balancing quality and speed. The sections that follow describe these strategies and compare their pros and cons. You’ll also find detailed code examples and benchmarks.