Layer Normalization
Layer normalization is a crucial technique in the training of neural networks, particularly in transformer architectures like GPT-2. It stabilizes the hidden state dynamics by normalizing the summed inputs to the neurons within a hidden layer. This normalization process significantly reduces training time and improves the stability and efficiency of the model.
Overview
The primary goal of layer normalization is to adjust the activations (outputs) of a neural network layer to have a mean of 0 and a variance of 1, also known as unit variance. This adjustment speeds up the convergence to effective weights and ensures consistent, reliable training. In modern transformer architectures, layer normalization is typically applied before and after the multi-head attention module, and before the final output layer.
Figure 4.5 An illustration of layer normalization where the six outputs of the layer, also called activations, are normalized such that they have a 0 mean and a variance of 1.
Implementation
Layer normalization can be implemented in a neural network using a PyTorch module. Below is a class definition for layer normalization:
Listing 4.2 A layer normalization class
class LayerNorm(nn.Module):
def __init__(self, emb_dim):
super().__init__()
self.eps = 1e-5
self.scale = nn.Parameter(torch.ones(emb_dim))
self.shift = nn.Parameter(torch.zeros(emb_dim))
def forward(self, x):
mean = x.mean(dim=-1, keepdim=True)
var = x.var(dim=-1, keepdim=True, unbiased=False)
norm_x = (x - mean) / torch.sqrt(var + self.eps)
return self.scale * norm_x + self.shift
This implementation operates on the last dimension of the input tensor x
, which represents the embedding dimension (emb_dim
). The variable eps
is a small constant added to the variance to prevent division by zero during normalization. The scale
and shift
are trainable parameters that the model adjusts during training to improve performance.
Practical Application
To apply the LayerNorm
module to a batch input, you can use the following code:
ln = LayerNorm(emb_dim=5)
out_ln = ln(batch_example)
mean = out_ln.mean(dim=-1, keepdim=True)
var = out_ln.var(dim=-1, unbiased=False, keepdim=True)
print("Mean:\n", mean)
print("Variance:\n", var)
The results from this code demonstrate that the layer normalization works as expected, normalizing the values of each input to have a mean of 0 and a variance of 1.
Biased Variance
In layer normalization, the variance is calculated with unbiased=False
, meaning it divides by the number of inputs n
without applying Bessel’s correction. This results in a biased estimate of the variance, which is negligible for large embedding dimensions and ensures compatibility with GPT-2 model’s normalization layers.
Layer Normalization vs. Batch Normalization
Layer normalization differs from batch normalization in that it normalizes across the features for each individual training case, rather than across the batch. This makes it particularly suitable for recurrent neural networks and transformer models where batch sizes can vary or be small.