chapter four

4 Applying Post-Training Quantization and Calibration

 

This chapter covers

  • Deciding when post-training quantization is sufficient for your deployment
  • Building production-faithful calibration sets
  • Range estimation algorithms and their trade-offs
  • Applying validation protocols

Chapter 3 established what to quantize—weights, activations, KV cache—and at what granularity—per-tensor, per-channel, or group-wise. You left with a decision checklist that tells you which tensors deserve attention and how finely to slice the scale factors.

But knowing what to quantize is not the same as knowing what data to derive your mapping from. The scale factor S and zero-point Z that define your quantization grid are determined by the activations the model encounters — and those activations depend on the data, not the model. Derive them from calibration data that doesn't match production, and your carefully chosen granularity scheme produces garbage in deployment. Derive them from data that does, and a model quantized in minutes can match original accuracy.

This is the domain of calibration: the art of observing your model on representative data to determine the optimal quantization parameters. And the good news is that for most practitioners, calibration is enough. You don't need to retrain. You don't need gradients. You need good data, the right range estimation algorithm, and a validation protocol that catches problems before they hit production.

4.1 Know when post-training quantization is enough

4.2 Build a calibration set that actually represents production

4.3 Estimate ranges: absolute max, percentile, and MSE

4.4 Validate accuracy and size with a repeatable protocol

4.5 Summary