Appendix A. The Transformer architecture

 

This chapter covers

  • Introduction to neural networks and deep learning
  • The transformer architecture
  • GPT pre-training
  • Key components of transformers
  • GPT inference process

To understand how Large Language Models (LLMs) work, it's essential to grasp the "Transformer architecture." This architecture was introduced in 2017 in a paper titled "Attention is All You Need" by Ashish Vaswani, the Google Brain team, and Google Research (https://arxiv.org/abs/1706.03762). The paper is based on principles of attention, encoder-decoder concepts, and requires some foundational knowledge in artificial neural networks, embeddings, and positional encodings.

This appendix covers key concepts to help you understand how LLMs work. While not an in-depth exploration of neural networks or advanced topics due to space limitations, it provides a high-level overview. The goal is to give you the foundation needed to understand the architectural diagram from Chapter 1 (Figure 1.10), which is also included below for convenience (Figure A.1). This diagram is crucial for understanding the structure and functionality of LLMs. For a deeper understanding of the Transformer architecture, see Transformers in Action by Nicole Koenigstein or Build a Large Language Model (From Scratch) by Sebastian Raschka, both published by Manning.

A.1 Neural Networks basics

A.2 Neural Network Training

A.3 From Recurrent Neural Networks to Transformers

A.3.1 Recurrent Neural Networks (RNNs)

A.3.2 The Transformer solution

A.4 The Transformer

A.5 Pre-training Fundamentals of Decoder-Only Transformers

A.5.1 Objective of Pre-Training

A.5.2 Scale of Pre-Training

A.5.3 Emerging properties

A.6 Building Blocks of Transformers

A.6.1 Token Embeddings

A.6.2 Positional Encodings

A.6.3 Self-attention

A.6.4 Multi-head attention

A.6.5 Feed-Forward Networks

A.7 The Inference Process of a GPT-Like Transformer

A.7.1 What Does GPT Mean?

A.7.2 High-Level Overview of GPT Text Generation

A.7.3 Transformer Inference Process in Detail