8 Advanced Quantization Techniques
This chapter covers
- The FlexGen technique to offload part of an LLM to memory and/or disk.
- An advance post-training quantization technique, called SmoothQuant, to reduce memory footprint and accelerate inference.
- BitNet, a scalable 1-bit Transformer architecture to reduce memory footprint and energy consumption.
This chapter introduces advanced techniques for LMMs’ quantization, for those cases where the traditional techniques learned in chapter 5 alone cannot solve some specific performance and/or computational challenges. Please be sure that you have understood in full chapter 5 before moving further with this chapter’s reading.