chapter eight
8 Pushing into Low Bits Safely
This chapter covers
- Sub-8-bit number formats: low-bit floats (FP8, FP4), distribution-matched types (NF4), and ternary
- FP8 E4M3/E5M2 encoding, scaling strategies, and production deployment on H100
- FP4 with blockwise scaling and its hardware dependencies
- QLoRA fine-tuning with NF4 on consumer GPUs
- Ternary models: why post-training conversion fails and where trained-from-scratch fits
- A reproducible Pareto frontier anchoring every claim in one comparison
We’ve closed out the integer side of the quantization story. Linear, uniformly spaced grids took us all the way to 4-bit weight-only LLMs with no measurable quality loss, INT8 doubled inference throughput on commodity hardware, GPTQ and AWQ shrank OPT-6.7B to 3.6 GB at 4-bit, and TurboQuant compressed the KV cache so long-context serving stayed within memory limits.
For most production deployments, those methods are where the work ends.