2 Building Quantization from First Principles
This chapter covers
- Fixed-point number representation
- Affine quantization mapping
- Scale and zero-point parameters
- Quantization error analysis
Every number inside a neural network—every weight, every activation, every gradient—lives on the real number line. Training happens in floating-point, where that line stretches as far as the mathematics requires. Deployment is a different story. The devices that run inference at scale—mobile phones, edge accelerators, data-center INT8 cores—speak a cruder language: small integers packed into 8-bit or 4-bit containers. Quantization is the engineering discipline that translates between these two worlds, mapping continuous floating-point values onto a finite grid of discrete integer levels.
The previous chapter established why this translation matters: smaller types mean less memory, faster arithmetic, and lower energy per inference. This chapter builds the translation itself. We start from the hardware up—how a fixed-point machine actually stores and manipulates numbers—and work toward the affine mapping that modern frameworks use to compress neural network tensors into integers. Along the way, we derive the scale and zero-point parameters, analyze the two kinds of error that quantization introduces, and confront the engineering trade-off between symmetric and asymmetric schemes.