chapter one

1 Facing the Efficiency Wall

 

This chapter covers

  • The memory-bandwidth bottleneck
  • Why quantization targets the dominant cost
  • The floating-point to integer transition

For most of the history of machine learning, efficiency was a secondary concern. Models were small enough to fit comfortably in memory. Inference was fast enough to feel instantaneous. When performance lagged, the usual remedies, such as better hardware, modest architectural tweaks, or more aggressive batching, were generally sufficient. Accuracy was the main currency, and the cost of getting there was often treated as an operational detail.

That era has ended.

The modern generation of models, especially large transformers, has pushed inference across a qualitative threshold. Parameter counts have exploded, context lengths have stretched by orders of magnitude, and workloads that once behaved like ordinary applications now behave like infrastructure. Latency flattens even on powerful GPUs. Utilization looks suspiciously low. Power draw and memory bandwidth, not arithmetic throughput, become the binding constraints.

1.1 The cost crisis in memory, latency, and power

1.2 Why quantization is the practical response

1.3 Mapping Floating Point vs Integer

1.3.1 Floating point's hidden costs

1.3.2 What integers give up, and what they gain

1.4 Summary