chapter five

5 Multi-token prediction and FP8 quantization

This chapter covers

Multi-Token Prediction for stronger training signals
Implementing a causal MTP architecture
Utilizing FP8 quantization to optimize training efficiency

We have now established the core architectural pillars of the DeepSeek model: Multi-Head Latent Attention and Mixture-of-Experts. These innovations define what the model computes. Now, we turn our attention to an equally important topic that defines how these computations are performed with incredible efficiency. This involves two key techniques that are central to DeepSeek's training methodology: Multi-Token Prediction (MTP) and FP8 Quantization. While FP8 quantization was already being adopted in the industry to accelerate inference, DeepSeek's key innovation was demonstrating its successful and stable application to the much more demanding task of large-scale training.

This chapter is divided into two main parts. First, we will dive deep into MTP, understanding its motivation, advantages, and exactly how DeepSeek implemented their advanced, causal version of it. You will learn not just the theory but also how to build a functional MTP module, seeing firsthand how predicting a horizon of tokens strengthens the model's planning capabilities. After mastering MTP, we will move to the second part, a deep dive into the FP8 Quantization framework that allows these massive models to be trained with remarkable speed and memory efficiency.

5.1 The core idea: From single-token to multi-token prediction

5.2 The four key advantages of MTP

5.2.1 Densification of training signals

5.2.2 Improved data efficiency

5.2.3 Better planning by prioritizing "choice points"

5.2.4 Higher inference speed via speculative decoding

5.3 The DeepSeek MTP architecture: A visual and mathematical walkthrough

5.3.1 The starting point: The shared transformer trunk

5.3.2 The MTP modules: A sequential chain of prediction

5.3.3 The final loss calculation

5.4 Implementing a causal multi-token prediction module from scratch

5.5 Quantization: Trading precision for speed and memory

5.5.1 What is quantization?

5.5.2 Why quantize? The memory cost of high-precision parameters

5.5.3 Understanding numerical formats: The building blocks of quantization

5.5.4 The basic mechanism: Scaling

5.5.5 The five pillars of DeepSeek's FP8 training

5.5.6 Pillar 1: The mixed precision framework

5.5.7 Pillar 2: Fine-grained quantization

5.5.8 Pillar 3: Increasing accumulation precision

5.5.9 Pillar 4: Mantissa over exponents

5.5.10 Pillar 5: Online quantization

5.6 Summary