8 Advanced Quantization Techniques

 

This chapter covers

  • The FlexGen technique to offload part of an LLM to memory and/or disk.
  • An advance post-training quantization technique, called SmoothQuant, to reduce memory footprint and accelerate inference.
  • BitNet, a scalable 1-bit Transformer architecture to reduce memory footprint and energy consumption.

This chapter introduces advanced techniques for LMMs’ quantization, for those cases where the traditional techniques learned in chapter 5 alone cannot solve some specific performance and/or computational challenges. Please be sure that you have understood in full chapter 5 before moving further with this chapter’s reading.

8.1 What if a domain-specific model isn’t small?

8.2 FlexGen

8.3 SmoothQuant

8.4 BitNet

8.4.1 BitNet and Python

8.5 Summary