GPT Model

The GPT (Generative Pre-trained Transformer) model is a type of large language model (LLM) that is designed for text generation tasks. It operates by predicting the next token in a sequence based on a given input context. The model is structured with token and positional embeddings, followed by a series of transformer blocks, and concludes with a layer normalization and a linear output layer. The output layer maps the transformer’s output to a high-dimensional space corresponding to the vocabulary size, enabling the prediction of the next token.

Architecture Overview

GPT models consist of many repeated transformer blocks and can have millions to billions of parameters. They come in various sizes, such as 124, 345, 762, and 1,542 million parameters. Despite the differences in size, they can be implemented using the same GPTModel Python class. The architecture of the GPT model is depicted in Figure 4.15.

[Figure 4.15](https://livebook.manning.com/build-a-large-language-model-from-scratch/chapter-4/figure--4-15) An overview of the GPT model architecture showing the flow of data through the GPT model. Starting from the bottom, tokenized text is first converted into token embeddings, which are then augmented with positional embeddings. This combined information forms a tensor that is passed through a series of transformer blocks shown in the center (each containing multi-head attention and feed forward neural network layers with dropout and layer normalization), which are stacked on top of each other and repeated 12 times. Figure 4.15 An overview of the GPT model architecture showing the flow of data through the GPT model. Starting from the bottom, tokenized text is first converted into token embeddings, which are then augmented with positional embeddings. This combined information forms a tensor that is passed through a series of transformer blocks shown in the center (each containing multi-head attention and feed forward neural network layers with dropout and layer normalization), which are stacked on top of each other and repeated 12 times.

The model begins with tokenized text that is converted into token embeddings. These embeddings are then augmented with positional embeddings to form a tensor. This tensor is passed through a series of transformer blocks, each containing multi-head attention and feed-forward neural network layers with dropout and layer normalization. In the 124-million-parameter GPT-2 model, the transformer block is repeated 12 times, while in the largest GPT-2 model with 1,542 million parameters, it is repeated 48 times.

The output from the final transformer block undergoes a final layer normalization step before reaching the linear output layer. This layer maps the transformer’s output to a high-dimensional space, which in the case of the GPT-2 model, corresponds to 50,257 dimensions, matching the model’s vocabulary size.

Implementation

The implementation of the GPT model architecture is compact, thanks to the use of the TransformerBlock class. Below is the code for the GPTModel class:

Listing 4.7 The GPT model architecture implementation

class GPTModel(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"])
        self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"])
        self.drop_emb = nn.Dropout(cfg["drop_rate"])

        self.trf_blocks = nn.Sequential(
            *[TransformerBlock(cfg) for _ in range(cfg["n_layers"])])

        self.final_norm = LayerNorm(cfg["emb_dim"])
        self.out_head = nn.Linear(
            cfg["emb_dim"], cfg["vocab_size"], bias=False
        )

    def forward(self, in_idx):
        batch_size, seq_len = in_idx.shape
        tok_embeds = self.tok_emb(in_idx)
        pos_embeds = self.pos_emb(
            torch.arange(seq_len, device=in_idx.device)
        )
        x = tok_embeds + pos_embeds
        x = self.drop_emb(x)
        x = self.trf_blocks(x)
        x = self.final_norm(x)
        logits = self.out_head(x)
        return logits

This class initializes the token and positional embeddings, applies dropout, and processes the input through a series of transformer blocks. The final output is normalized and passed through a linear layer to produce the logits for the next token prediction.

To initialize a 124-million-parameter GPT model, the following code can be used:

torch.manual_seed(123)
model = GPTModel(GPT_CONFIG_124M)

out = model(batch)
print("Input batch:\n", batch)
print("\nOutput shape:", out.shape)
print(out)

This code sets a random seed for reproducibility, initializes the model with a configuration dictionary, and processes a batch of text input to produce the output tensor.

sitemap

Unable to load book!

The book could not be loaded.

(try again in a couple of minutes)

manning.com homepage
test yourself with a liveTest