chapter nine

9 Advanced Quantization Techniques

 

This chapter covers

  • The FlexGen technique to offload part of an LLM to memory and/or disk.
  • An advance post-training quantization technique, called SmoothQuant, to reduce memory footprint and accelerate inference.
  • BitNet, a scalable 1-bit Transformer architecture to reduce memory footprint and energy consumption.

This chapter introduces advanced techniques for LMMs’ quantization, for those cases where the traditional techniques learned in chapter 6 alone cannot solve some specific performance and/or computational challenges. Please be sure that you have understood in full chapter 6 before moving further with this chapter’s reading.

9.1 What if a domain-specific model isn’t small?

9.2 FlexGen

9.3 SmoothQuant

9.4 BitNet

9.4.1 BitNet and Python

9.5 Summary