welcome · Quantization and Fast Inference

welcome

Thank you for purchasing the MEAP for Quantization and Fast Inference. To get the most out of this book, you'll want to be comfortable with Python and PyTorch, and have built and trained a few neural networks. Some exposure to GPU execution will help, but you don't need to be a kernel author. The book is written for ML engineers, infrastructure engineers, and applied researchers with roughly two to six years of experience. If you've shipped a model to a real system and felt the weight of its latency, memory footprint, or serving cost, you're in the right place.

Quantization has become one of the load-bearing techniques of modern AI. A 70B-parameter model in FP16 is an expensive thing to serve; the same model in 4-bit can fit on a single consumer GPU and run fast enough to hold a conversation. The gap between "this model is interesting" and "this model is deployable" is, more often than not, a quantization problem. And yet most of the material on the subject is scattered across arXiv papers, framework-specific tutorials, and blog posts that hand-wave the math and the numerical gotchas that actually bite you in production.