chapter two

2 Building Quantization from First Principles

This chapter covers

Fixed-point number representation
Affine quantization mapping
Scale and zero-point parameters
Quantization error analysis

Every number inside a neural network—every weight, every activation, every gradient—lives on the real number line. Training happens in floating-point, where that line stretches as far as the mathematics requires. Deployment is a different story. The devices that run inference at scale—mobile phones, edge accelerators, data-center INT8 cores—speak a cruder language: small integers packed into 8-bit or 4-bit containers. Quantization is the engineering discipline that translates between these two worlds, mapping continuous floating-point values onto a finite grid of discrete integer levels.

The previous chapter established why this translation matters: smaller types mean less memory, faster arithmetic, and lower energy per inference. This chapter builds the translation itself. We start from the hardware up—how a fixed-point machine actually stores and manipulates numbers—and work toward the affine mapping that modern frameworks use to compress neural network tensors into integers. Along the way, we derive the scale and zero-point parameters, analyze the two kinds of error that quantization introduces, and confront the engineering trade-off between symmetric and asymmetric schemes.

2 Building Quantization from First Principles

This chapter covers

2.1 The fixed-point world

2.2 Define the affine mapping with scale and zero point

2.3 Characterize granular and overload error

2.4 Compare symmetric and asymmetric choices

2.5 End-to-end INT8 inference in PyTorch

2.6 Summary