chapter nine

9 Quantization

This chapter covers

How FP32, INT8, and INT4 formats trade memory for accuracy
Implementing quantize/dequantize steps and reasoning about error
Choosing granularity schemes that balance hardware cost and fidelity
Calibrating scales and zero-points with static or dynamic approaches
Navigating symmetric vs asymmetric choices alongside AWQ refinements
Stitching these ideas into an end-to-end model quantization workflow

Picture yourself trying to run a modern language model on a single GPU. The model weights alone consume 700GB of memory, but your GPU only has 80GB. This isn’t a hypothetical problem - it’s the reality facing anyone trying to deploy large models in production. Yet somehow, companies are running these massive models on consumer hardware. How? The answer lies in a technique so powerful it feels like cheating: quantization.

Here’s what makes quantization genuinely magical: when you compress a model from 16-bit to 4-bit precision, you’re not just getting a modest performance bump. The model suddenly needs 4x less memory to load, 4x less bandwidth to move data around, and inference runs twice as fast. Your GPU bill gets cut in half. This isn’t optimization at the margins - this is the difference between "impossible on this hardware" and "runs smoothly in production."

9.1 Quantization building blocks

9.1.1 Data types: the foundation of quantization

9.1.2 Basic quant/dequant operations

9.2 Deployment strategies

9.2.1 Granularity schemes: how to apply quantization

9.2.2 Calibration: computing quantization parameters

9.2.3 Static versus dynamic quantization

9.2.4 Symmetric versus asymmetric quantization

9.3 Advanced techniques

9.3.1 AWQ: activation-aware weight quantization

9.3.2 Simple example: big versus small inputs

9.4 Putting it all together

9.4.1 Real model quantization workflow

9.4.2 Accuracy validation strategy

9.4.3 Common pitfalls

9.5 Where we can go further

9.5.1 GPTQ

9.5.2 NF4: NormalFloat4

9.5.3 Quantization-aware training (QAT)