chapter six

6 The DeepSeek training pipeline: Building a foundation model

 

This chapter covers

  • Assembling the DeepSeek V3 architecture
  • Building a complete data and training pipeline
  • DualPipe Parallelism for efficient scaling

We’ve deconstructed DeepSeek’s core innovations: Multi-Head Latent Attention, Decoupled RoPE, DeepSeek Mixture-of-Experts layer, and the Multi-Toke

n Prediction training objective. We have explored the theory and even built some of these components in isolation. Now, it's time to put it all together, transitioning from individual components to a complete, functional system.

We will integrate all these advanced concepts into a single PyTorch model, a "mini-DeepSeek V3" that you can train from scratch on your own hardware. This hands-on process is the final step in truly understanding how these theoretical pieces interact in a real-world training environment. Our journey covers the entire pipeline, from raw data to a working, text-generating model.

Figure 6.1 Our four-stage journey to build the DeepSeek model. This chapter is dedicated to part 2 of Stage 3, where we take the core architecture from Stage 2, combine it with advanced training techniques like MTP and FP8, and build a complete foundation model. We will also introduce the final training innovation, DualPipe Parallelism.

6.1 The data foundation: Preparing the TinyStories dataset

6.1.1 Choosing the right tools: The TinyStories dataset and TikToken

6.1.2 Setting up the environment

6.1.3 The prepare.py script: A step-by-step walkthrough

6.2 Assembling the Mini-DeepSeek model

6.2.1 Building the components: RoPE, MLA, MoE, and MTP

6.2.2 The final architecture: MiniDeepSeek

6.3 The training pipeline: Bringing the model to life

6.3.1 Configuration and system setup

6.3.2 Loading data and scheduling the learning rate

6.3.3 The main training script

6.4 The engine of scale: Understanding DualPipe Parallelism

6.4.1 The standard training loop and its memory limit

6.4.2 Gradient accumulation in code

6.4.3 Data parallelism: The "more GPUs, less time" approach

6.4.4 Method 1: nn.DataParallel (DP)

6.4.5 Method 2: nn.DistributedDataParallel (DDP)

6.4.6 Practical implementation of DDP

6.4.7 DDP + gradient accumulation

6.5 Model parallelism

6.5.1 Key terminology for model parallelism

6.5.2 Pipeline parallelism: The assembly line approach

6.5.3 The naive schedule (GPipe) and the pipeline bubble

6.5.4 The 1F1B schedule: Eliminating the bubble

6.5.5 The DeepSeek innovation: DualPipe Parallelism

6.5.6 The DualPipe schedule: Hiding MoE communication