Layer Normalization

Layer normalization is a crucial technique in the training of neural networks, particularly in transformer architectures like GPT-2. It stabilizes the hidden state dynamics by normalizing the summed inputs to the neurons within a hidden layer. This normalization process significantly reduces training time and improves the stability and efficiency of the model.

Overview

The primary goal of layer normalization is to adjust the activations (outputs) of a neural network layer to have a mean of 0 and a variance of 1, also known as unit variance. This adjustment speeds up the convergence to effective weights and ensures consistent, reliable training. In modern transformer architectures, layer normalization is typically applied before and after the multi-head attention module, and before the final output layer.

[Figure 4.5](https://livebook.manning.com/build-a-large-language-model-from-scratch/chapter-4/figure--4-5) An illustration of layer normalization where the six outputs of the layer, also called activations, are normalized such that they have a 0 mean and a variance of 1. Figure 4.5 An illustration of layer normalization where the six outputs of the layer, also called activations, are normalized such that they have a 0 mean and a variance of 1.

Implementation

Layer normalization can be implemented in a neural network using a PyTorch module. Below is a class definition for layer normalization:

Listing 4.2 A layer normalization class

class LayerNorm(nn.Module):
    def __init__(self, emb_dim):
        super().__init__()
        self.eps = 1e-5
        self.scale = nn.Parameter(torch.ones(emb_dim))
        self.shift = nn.Parameter(torch.zeros(emb_dim))

    def forward(self, x):
        mean = x.mean(dim=-1, keepdim=True)
        var = x.var(dim=-1, keepdim=True, unbiased=False)
        norm_x = (x - mean) / torch.sqrt(var + self.eps)
        return self.scale * norm_x + self.shift

This implementation operates on the last dimension of the input tensor x, which represents the embedding dimension (emb_dim). The variable eps is a small constant added to the variance to prevent division by zero during normalization. The scale and shift are trainable parameters that the model adjusts during training to improve performance.

Practical Application

To apply the LayerNorm module to a batch input, you can use the following code:

ln = LayerNorm(emb_dim=5)
out_ln = ln(batch_example)
mean = out_ln.mean(dim=-1, keepdim=True)
var = out_ln.var(dim=-1, unbiased=False, keepdim=True)
print("Mean:\n", mean)
print("Variance:\n", var)

The results from this code demonstrate that the layer normalization works as expected, normalizing the values of each input to have a mean of 0 and a variance of 1.

Biased Variance

In layer normalization, the variance is calculated with unbiased=False, meaning it divides by the number of inputs n without applying Bessel’s correction. This results in a biased estimate of the variance, which is negligible for large embedding dimensions and ensures compatibility with GPT-2 model’s normalization layers.