5 Quantizing for Your Production Environment

 

This chapter covers

  • The precision formats used to train and run LLMs.
  • Quantization of LLMs in different small precision formats.
  • How to do LLM quantization using different techniques and libraries.

Chapter 7 introduced you to the main ONNX framework concepts and capabilities. Among these, the possibly to perform model quantization has been only slightly touched: because of the important role it plays in LLMs inference performance boost, it requires a dedicated chapter. This is the core topic here.

5.1 Transformers precision formats

With reference to LLM training or inference, some numeric precision formats have been mentioned across the previous chapters of this book. Let’s now deep dive on them and expand further also to other precision formats specific for model quantization. In conventional scientific computing, 64-bit floating point number (also known as double precision) was typically used as standard, due to its ability to represent a wide range of numbers with higher accuracy. However, when training deep neural networks on GPU(s), a lower precision, 32-bit floating point, is used as 64-bit floating point operations are considered unnecessary and computationally expensive and GPU hardware is also not optimized for 64-bit precision. 32-bit floating point operations (also known as single-precision) have become then the standard for DL training.

5.2 8-bit quantization

5.2.1 Hands-on 8-bit quantization

5.2.2 LLM.int8() and quantization

5.3 8-bit quantization with ONNX

5.4 4-bit quantization

5.4.1 4-bit quantization with GPTQ

5.4.2 4-bit quantization with ggml

5.5 Summary