chapter three

3 Choosing What to Quantize and at What Granularity

 

This chapter covers

  • Weight, activation, and KV cache quantization targets
  • Per-tensor, per-channel, and group-wise granularity schemes
  • Memory layout and kernel efficiency trade-offs
  • Decision frameworks for precision allocation

A neural network is not a single tensor—it contains weights (frozen after training), activations (flowing through inference), and—in transformers—a KV cache that grows with sequence length. These have radically different statistical properties and sensitivities to error. Weights are static and bell-shaped, tolerant of rounding; activations shift with every input and develop extreme outliers in transformer hidden states; the KV cache sits somewhere in between, written once per token but read thousands of times. Quantizing all three identically—say, INT8 symmetric with a single scale per tensor—wastes precision on weights that could tolerate INT4, fails to capture activation outliers that need special handling, and ignores the structural asymmetry between keys and values in the cache. Quantizing none is equally wasteful: a 70B-parameter model in FP16 occupies 140 GB, most of it dead weight during memory-bound autoregressive generation. The skill lies in knowing where each bit of precision pays for itself.

3.1 Identify where precision loss matters

3.2 Quantize weights with per-tensor and per-channel choices

3.3 Quantize activations under outliers and dynamic ranges

3.4 Quantize Transformer Key-Value Cache with Mixed Precision

3.5 Use group and block schemes and understand memory layout

3.6 Follow a decision checklist for granularity

3.7 Summary