5 Multi-token prediction and FP8 quantization
This chapter covers
- Multi-Token Prediction for stronger training signals
- Implementing a causal MTP architecture
- Utilizing FP8 quantization to optimize training efficiency
We have now established the core architectural pillars of the DeepSeek model: Multi-Head Latent Attention and Mixture-of-Experts. These innovations define what the model computes. Now, we turn our attention to an equally important topic that defines how these computations are performed with incredible efficiency. This involves two key techniques that are central to DeepSeek's training methodology: Multi-Token Prediction (MTP) and FP8 Quantization. While FP8 quantization was already being adopted in the industry to accelerate inference, DeepSeek's key innovation was demonstrating its successful and stable application to the much more demanding task of large-scale training.
This chapter is divided into two main parts. First, we will dive deep into MTP, understanding its motivation, advantages, and exactly how DeepSeek implemented their advanced, causal version of it. You will learn not just the theory but also how to build a functional MTP module, seeing firsthand how predicting a horizon of tokens strengthens the model's planning capabilities. After mastering MTP, we will move to the second part, a deep dive into the FP8 Quantization framework that allows these massive models to be trained with remarkable speed and memory efficiency.