GPT (Generative Pre-trained Transformer)
Overview
GPT, or Generative Pre-trained Transformer, is a model architecture designed for generating new text. It is based on the decoder module of the transformer architecture. The primary function of GPT models is to predict the next token in a sequence, which they achieve by processing input text through a series of components, resulting in a high-dimensional output corresponding to the vocabulary size.
Architecture
The architecture of GPT models is built upon the transformer architecture, which includes several key components:
Token and Positional Embeddings: The input text is first tokenized and then converted into token embeddings. These embeddings are augmented with positional embeddings to incorporate the order of tokens in the sequence.
Transformer Blocks: The combined token and positional embeddings form a tensor that is passed through multiple transformer blocks. Each block contains multi-head attention and feed-forward neural network layers, along with dropout and layer normalization. These blocks are stacked and repeated multiple times (e.g., 12 times in some configurations).
Final Output Layer: After processing through the transformer blocks, the output is normalized and passed through a final linear layer to produce logits corresponding to the vocabulary size.
The following figure provides a visual overview of the GPT model architecture:
Figure 4.15 An overview of the GPT model architecture showing the flow of data through the GPT model.
Implementation
The GPT model can be implemented using a class structure in a deep learning framework such as PyTorch. Below is an example implementation of the GPT model architecture:
Listing 4.7 The GPT model architecture implementation
class GPTModel(nn.Module):
def __init__(self, cfg):
super().__init__()
self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"])
self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"])
self.drop_emb = nn.Dropout(cfg["drop_rate"])
self.trf_blocks = nn.Sequential(
*[TransformerBlock(cfg) for _ in range(cfg["n_layers"])]
)
self.final_norm = LayerNorm(cfg["emb_dim"])
self.out_head = nn.Linear(
cfg["emb_dim"], cfg["vocab_size"], bias=False
)
def forward(self, in_idx):
batch_size, seq_len = in_idx.shape
tok_embeds = self.tok_emb(in_idx)
pos_embeds = self.pos_emb(
torch.arange(seq_len, device=in_idx.device)
)
x = tok_embeds + pos_embeds
x = self.drop_emb(x)
x = self.trf_blocks(x)
x = self.final_norm(x)
logits = self.out_head(x)
return logits
Usage
The term gpt
often refers to an instance of the GPTModel
class. This instance is initialized with a configuration dictionary and can be used to load and evaluate the GPT-2 model with pretrained weights. The gpt
instance is commonly used for generating text and performing other related tasks.