1 Facing the Efficiency Wall
This chapter covers
- The memory-bandwidth bottleneck
- Why quantization targets the dominant cost
- The floating-point to integer transition
For most of the history of machine learning, efficiency was a secondary concern. Models were small enough to fit comfortably in memory. Inference was fast enough to feel instantaneous. When performance lagged, the usual remedies—better hardware, modest architectural tweaks, or more aggressive batching—were generally sufficient. Accuracy was the main currency, and the cost of getting there was often treated as an operational detail.
That era has ended.
The modern generation of models, especially large transformers, has pushed inference across a qualitative threshold. Parameter counts exploded, context lengths stretched by orders of magnitude, and workloads that once behaved like ordinary applications now behave like infrastructure. Latency flattens even on powerful GPUs. Utilization looks suspiciously low. Power draw and memory bandwidth, not arithmetic throughput, become the binding constraints.
Quantization is the technique of representing neural network weights and activations using fewer bits—typically moving from 16-bit or 32-bit floating point down to 8-bit or 4-bit integers. By reducing the number of bits that must be stored and moved through the memory hierarchy, quantization directly attacks the dominant cost of modern inference: data movement.