Multi-Head Attention

Overview

Multi-head attention is an advanced mechanism that extends the traditional attention model by incorporating multiple attention heads operating in parallel. Each head independently computes attention scores and context vectors, which enables the model to capture various aspects of the input data. By doing so, multi-head attention enhances the model’s ability to recognize complex patterns within the data. The outputs from all the attention heads are concatenated to form the final context vector, providing a richer representation of the input.

Definition

In multi-head attention, multiple instances of causal attention are employed, allowing the model to focus on different parts of the input sequence simultaneously. This parallel processing capability is crucial for understanding intricate dependencies and relationships within the data, which is particularly beneficial in tasks involving sequential data such as natural language processing.

Multi-Head Attention Module

The multi-head attention module can be visualized as a structure where multiple single-head attention modules are stacked on top of each other. Instead of using a single matrix for computing the value matrices, a multi-head attention module with two heads, for example, utilizes two separate value weight matrices: ( Wv1 ) and ( Wv2 ). The same principle applies to the other weight matrices, ( WQ ) and ( Wk ). This configuration results in two sets of context vectors, ( Z1 ) and ( Z2 ), which are then combined into a single context vector matrix ( Z ).

[Figure 3.24](https://livebook.manning.com/build-a-large-language-model-from-scratch/chapter-3/figure--3-24) The multi-head attention module includes two single-head attention modules stacked on top of each other. So, instead of using a single matrix Wv for computing the value matrices, in a multi-head attention module with two heads, we now have two value weight matrices: Wv1 and Wv2. The same applies to the other weight matrices, WQ and Wk. We obtain two sets of context vectors Z1 and Z2 that we can combine into a single context vector matrix Z. Figure 3.24 The multi-head attention module includes two single-head attention modules stacked on top of each other. So, instead of using a single matrix Wv for computing the value matrices, in a multi-head attention module with two heads, we now have two value weight matrices: Wv1 and Wv2. The same applies to the other weight matrices, WQ and Wk. We obtain two sets of context vectors Z1 and Z2 that we can combine into a single context vector matrix Z.

sitemap

Unable to load book!

The book could not be loaded.

(try again in a couple of minutes)

manning.com homepage
test yourself with a liveTest