chapter three

3 Choosing What to Quantize and at What Granularity

 

This chapter covers

  • Weight, activation, and KV cache quantization targets
  • Per-tensor, per-channel, and group-wise granularity schemes
  • Memory layout and kernel efficiency trade-offs
  • Decision frameworks for precision allocation

A neural network is not a single tensor: it contains weights (the parameters learned during training, frozen at inference), activations (the intermediate tensors that flow between layers as data passes through, recomputed for every input), and, in transformers, a key-value cache (the KV cache: the keys K and values V from past tokens, stored so the attention mechanism doesn't have to recompute them at every generation step).

These have radically different statistical properties and sensitivities to error.

Weights are static and bell-shaped, tolerant of rounding; activations shift with every input and develop extreme outliers in transformer hidden states; the KV cache sits somewhere in between, written once per token but read thousands of times. Quantizing all three identically—say, INT8 symmetric with a single scale per tensor—wastes precision on weights that could tolerate INT4, fails to capture activation outliers that need special handling, and ignores the structural asymmetry between keys and values in the cache. Quantizing none is equally wasteful: a 70B-parameter model in FP16 occupies 140 GB, most of it dead weight during memory-bound autoregressive generation.

3.1 Identify where precision loss matters

3.2 Quantize weights with per-tensor and per-channel choices

3.3 Quantize activations under outliers and dynamic ranges

3.4 Quantize transformer key-value cache with mixed precision

3.5 Use group and block schemes and understand memory layout

3.6 Follow a decision checklist for granularity

3.7 Summary