chapter one

1 Facing the Efficiency Wall

This chapter covers

The memory-bandwidth bottleneck
Why quantization targets the dominant cost
The floating-point to integer transition

For most of the history of machine learning, efficiency was a secondary concern. Models were small enough to fit comfortably in memory. Inference was fast enough to feel instantaneous. When performance lagged, the usual remedies—better hardware, modest architectural tweaks, or more aggressive batching—were generally sufficient. Accuracy was the main currency, and the cost of getting there was often treated as an operational detail.

That era has ended.

The modern generation of models, especially large transformers, has pushed inference across a qualitative threshold. Parameter counts exploded, context lengths stretched by orders of magnitude, and workloads that once behaved like ordinary applications now behave like infrastructure. Latency flattens even on powerful GPUs. Utilization looks suspiciously low. Power draw and memory bandwidth, not arithmetic throughput, become the binding constraints.

Quantization is the technique of representing neural network weights and activations using fewer bits—typically moving from 16-bit or 32-bit floating point down to 8-bit or 4-bit integers. By reducing the number of bits that must be stored and moved through the memory hierarchy, quantization directly attacks the dominant cost of modern inference: data movement.

1 Facing the Efficiency Wall

This chapter covers

1.1 The cost crisis in memory, latency, and power

1.2 Why quantization is the practical response

1.3 Mapping Floating Point vs Integer at a high level

1.4 Summary