chapter two

2 Building Quantization from First Principles

 

This chapter covers

  • Fixed-point number representation
  • Affine quantization mapping
  • Scale and zero-point parameters
  • Quantization error analysis

Every number inside a neural network, every weight, every activation, every gradient, lives on the real number line. Training happens in floating-point, where that line stretches as far as the mathematics requires. Deployment is a different story. The devices that run inference at scale, including mobile phones, edge accelerators, and data-center INT8 cores, speak a cruder language: small integers packed into 8-bit or 4-bit containers. Quantization is the engineering discipline that translates between these two worlds, mapping continuous floating-point values onto a finite grid of discrete integer levels.

We previously established why this translation matters: smaller types mean less memory, faster arithmetic, and lower energy per inference.

Now, we build the translation itself. We start from the hardware up, looking at how a fixed-point machine actually stores and manipulates numbers, and work toward the affine mapping that modern frameworks use to compress neural network tensors into integers. Along the way, we derive the scale and zero-point parameters, analyze the two kinds of error that quantization introduces, and confront the engineering tradeoff between symmetric and asymmetric schemes.

2.1 The fixed-point world

2.2 Define the affine mapping with scale and zero point

2.3 Characterize granular and overload errors

2.4 Compare symmetric and asymmetric choices

2.5 End-to-end INT8 inference in PyTorch

2.6 Summary