3 Coding Attention Mechanisms


This chapter covers

  • Exploring the reasons for using attention mechanisms in neural networks
  • Introducing a basic self-attention framework and progressing to an enhanced self-attention mechanism
  • Implementing a causal attention module that allows LLMs to generate one token at a time
  • Masking randomly selected attention weights with dropout to reduce overfitting
  • Stacking multiple causal attention modules into a multi-head attention module

In the previous chapter, you learned how to prepare the input text for training LLMs. This involved splitting text into individual word and subword tokens, which can be encoded into vector representations, the so-called embeddings, for the LLM.

In this chapter, we will now look at an integral part of the LLM architecture itself, attention mechanisms, as illustrated in Figure 3.1.

Figure 3.1 A mental model of the three main stages of coding an LLM, pretraining the LLM on a general text dataset, and finetuning it on a labeled dataset. This chapter focuses on attention mechanisms, which are an integral part of an LLM architecture.

Attention mechanisms are a comprehensive topic, which is why we are devoting a whole chapter to it. We will largely look at these attention mechanisms in isolation and focus on them at a mechanistic level. In the next chapter, we will then code the remaining parts of the LLM surrounding the self-attention mechanism to see it in action and to create a model to generate text.

3.1 The problem with modeling long sequences

3.2 Capturing data dependencies with attention mechanisms

3.3 Attending to different parts of the input with self-attention

3.3.1 A simple self-attention mechanism without trainable weights

3.3.2 Computing attention weights for all input tokens

3.4 Implementing self-attention with trainable weights

3.4.1 Computing the attention weights step by step

3.4.2 Implementing a compact self-attention Python class

3.5 Hiding future words with causal attention

3.5.1 Applying a causal attention mask

3.5.2 Masking additional attention weights with dropout

3.5.3 Implementing a compact causal attention class

3.6 Extending single-head attention to multi-head attention

3.6.1 Stacking multiple single-head attention layers

3.6.2 Implementing multi-head attention with weight splits

3.7 Summary
