chapter eight

8 Pushing into Low Bits Safely

 

This chapter covers

  • Sub-8-bit number formats: low-bit floats (FP8, FP4), distribution-matched types (NF4), and ternary
  • FP8 E4M3/E5M2 encoding, scaling strategies, and production deployment on H100
  • FP4 with blockwise scaling and its hardware dependencies
  • QLoRA fine-tuning with NF4 on consumer GPUs
  • Ternary models: why post-training conversion fails and where trained-from-scratch fits
  • A reproducible Pareto frontier anchoring every claim in one comparison

We’ve closed out the integer side of the quantization story. Linear, uniformly spaced grids took us all the way to 4-bit weight-only LLMs with no measurable quality loss, INT8 doubled inference throughput on commodity hardware, GPTQ and AWQ shrank OPT-6.7B to 3.6 GB at 4-bit, and TurboQuant compressed the KV cache so long-context serving stayed within memory limits.

For most production deployments, those methods are where the work ends.

8.1 Decide when to go below eight bits

8.2 Use FP8 formats and understand kernel caveats

8.2.1 Value distribution: why non-uniform spacing matters

8.2.2 Scaling is mandatory, and granularity matters

8.2.3 Weight reconstruction: FP8 vs. INT8

8.2.4 FP8 deployment: 2× compression, zero quality loss

8.3 Use FP4 with blockwise scaling without losing stability

8.3.1 Seven values and a scale factor

8.3.2 Blockwise scaling: from optional to mandatory

8.3.3 Cross-format comparison: FP4 block-16 versus INT4 group-128

8.3.4 FP4 deployment: the quality cliff and how block scaling recovers

8.4 Fine-tune with low-bit adapters

8.5 Understand ternary quantization and where it fits

8.5.1 Post-training ternary quantization fails

8.5.2 BitLinear: making ternary trainable

8.5.3 What you can run today

8.6 Summary