chapter three

3 The DeepSeek breakthrough: Multi-Head Latent Attention (MLA)

This chapter covers

Compressing the KV Cache with Multi-Head Latent Attention (MLA)
Injecting positional awareness with Rotary Positional Encoding (RoPE)
Fusing MLA and RoPE with a decoupled architecture

In our last chapter, we completed Stage 1 of our journey by building a solid foundation in efficient LLM inference. We began with the problem of repeated calculations, which we solved with the KV Cache. However, we then saw the dark side of the KV Cache: its massive memory cost. We explored the first-generation solutions, MQA and GQA, which help with memory usage but introduce a painful trade-off by sacrificing the expressive power of Multi-Head Attention (MHA). This left us with an unresolved tension between performance and efficiency.

Figure 3.1 Our four-stage journey to build the DeepSeek model. Having completed the Key-Value Cache Foundation (Stage 1), we now begin Stage2. This chapter focuses on the highlighted component, Multi-Head Latent Attention (MLA) & Decoupled RoPE, the first major innovation in the core architecture.

3.1 MLA: The best of both worlds

3.2 The MLA architecture: A visual walkthrough

3.2.1 The query path (unchanged)

3.2.2 The key/value path (the innovation)

3.3 The mathematical magic: How the latent matrix helps

3.3.1 A Step-by-Step Derivation of Q, K, and V in MLA

3.3.2 The absorption trick: How attention scores are calculated

3.3.3 The final step: Calculating the context vector

3.4 The new inference loop with MLA

3.4.1 What happens when a new token arrives?

3.4.2 Caching the latent vector: The only thing we store

3.4.3 Decompressing the cache and calculating attention

3.5 Quantifying the gains

3.5.1 The new KV cache formula: A 64x reduction

3.5.2 Preserving performance: Why head diversity is maintained

3.6 Building an MLA module from scratch

3.7 The problem of order

3.8 Attempt #1: The naive approach - integer positional encodings

3.8.1 The simple idea: Using position numbers directly

3.8.2 The major flaw: Polluting semantic embeddings with large magnitudes

3.9 Attempt #2: A step forward - Binary positional encodings

3.9.1 Solving the magnitude problem with binary representation

3.9.2 Uncovering a deeper pattern: Oscillation frequencies

3.9.3 The new problem: The issue with discontinuous jumps

3.10 Attempt #3: The "Attention Is All You Need" breakthrough - sinusoidal positional encodings

3.10.1 From discrete jumps to smooth waves: Introducing sine and cosine

3.10.2 The power of rotation: Encoding relative positions