GELU (Gaussian Error Linear Unit)

Overview

The Gaussian Error Linear Unit (GELU) is an activation function used in neural networks, particularly in large language models (LLMs). It is a smooth, nonlinear function that approximates the Rectified Linear Unit (ReLU) but with a non-zero gradient for almost all negative values. This property allows GELU to enable more nuanced adjustments to the model’s parameters during training, contributing to better optimization properties.

Definition

GELU combines the properties of both the classic ReLU activation function and the normal distribution’s cumulative distribution function. This combination allows GELU to model layer outputs with stochastic regularization and nonlinearities, which are beneficial in deep learning models.

Implementation

In code, GELU can be implemented as a PyTorch module. Below is an example of how to implement the GELU activation function:

Listing 4.3 An implementation of the GELU activation function

class GELU(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, x):
        return 0.5 * x * (1 + torch.tanh(
            torch.sqrt(torch.tensor(2.0 / torch.pi)) * 
            (x + 0.044715 * torch.pow(x, 3))
        ))

Comparison with ReLU

To understand how GELU compares to ReLU, we can plot these functions side by side. The plot in Figure 4.8 illustrates the differences between the two activation functions.

[Figure 4.8](https://livebook.manning.com/build-a-large-language-model-from-scratch/chapter-4/figure--4-8) The output of the GELU and ReLU plots using matplotlib. The x-axis shows the function inputs and the y-axis shows the function outputs. Figure 4.8 The output of the GELU and ReLU plots using matplotlib. The x-axis shows the function inputs and the y-axis shows the function outputs.

As shown in the plot, ReLU (right) is a piecewise linear function that outputs the input directly if it is positive; otherwise, it outputs zero. In contrast, GELU (left) is a smooth, nonlinear function that approximates ReLU but maintains a non-zero gradient for almost all negative values (except at approximately x = –0.75). This smoothness can lead to better optimization properties during training, as it allows for more nuanced adjustments to the model’s parameters. Unlike ReLU, which outputs zero for any negative input, GELU allows for a small, non-zero output for negative values. This characteristic means that during the training process, neurons that receive negative input can still contribute to the learning process, albeit to a lesser extent than positive inputs.

Application in Neural Networks

GELU is often used in the construction of neural network modules. For example, it can be used in a small neural network module, FeedForward, which is part of a transformer block in LLMs. Below is an example of how GELU is integrated into a feed-forward neural network module:

Listing 4.4 A feed forward neural network module

class FeedForward(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(cfg["emb_dim"], 4 * cfg["emb_dim"]),
            GELU(),
            nn.Linear(4 * cfg["emb_dim"], cfg["emb_dim"]),
        )

    def forward(self, x):
        return self.layers(x)

In this FeedForward module, GELU is used between two Linear layers. In the 124-million-parameter GPT model, it receives input batches with tokens that have an embedding size of 768 each, as specified by the GPT_CONFIG_124M dictionary where GPT_CONFIG_124M["emb_dim"] = 768. This setup demonstrates how GELU can be effectively utilized in deep learning architectures to enhance model performance.