chapter four

4 Applying Post-Training Quantization and Calibration

 

This chapter covers

  • Deciding when post-training quantization is sufficient for your deployment
  • Building production-faithful calibration sets
  • Range estimation algorithms and their trade-offs
  • Applying validation protocols

Chapter 3 established what to quantize—weights, activations, KV cache—and at what granularity—per-tensor, per-channel, or group-wise. You left with a decision checklist that tells you which tensors deserve attention and how finely to slice the scale factors.

But knowing what to quantize is not the same as knowing how to set those scale factors in practice. The scale factor S and zero-point Z that define your quantization grid aren't arbitrary constants—they're empirical quantities derived from data. Get them wrong, and your carefully chosen granularity scheme produces garbage. Get them right, and a model quantized in minutes can match original accuracy.

This is the domain of calibration: the art of observing your model on representative data to determine the optimal quantization parameters. And the good news is that for most practitioners, calibration is enough. You don't need to retrain. You don't need gradients. You need good data, the right range estimation algorithm, and a validation protocol that catches problems before they hit production.

4.1 Know when post-training quantization is enough

4.2 Build a calibration set that actually represents production

4.3 Estimate ranges: Absolute max, percentile, and MSE

4.4 Validate accuracy and size with a repeatable protocol

4.5 Summary