6 Quantizing for your production environment
This chapter covers
- Precision formats for training and running LLMs
- Quantizing LLMs to smaller-precision formats
- Quantizing LLMs with multiple techniques and libraries
In chapter 5, you learned the core ONNX concepts and capabilities. I only briefly mentioned model quantization, but it’s crucial for LLM inference performance, so it deserves its own chapter. That’s our focus here.
6.1 Transformers precision formats
Previous chapters have mentioned several numeric precision formats used for LLM training and inference. It’s time to look at them more closely and to introduce additional formats used for model quantization.
In traditional scientific computing, 64-bit floating point (double precision) was the default because it represents a wide range of values accurately. But deep neural networks on GPUs typically use 32-bit floating point (single precision) because 64-bit operations are unnecessary, slower, and often not well supported by GPU hardware. As a result, FP32 became the standard for deep-learning training.
In floating-point numbers, bits are the binary digits used to store a value in a computer’s memory. More bits mean higher precision and a wider representable range. A floating-point value has three parts: the sign, the exponent, and the mantissa. In 32-bit floating point, 1 bit is for the sign, 8 bits are for the exponent, and 23 bits are for the mantissa (see figure 6.1).
Figure 6.1 Representation of 32-bit floating point numbers