Appendix A. The Transformer architecture
This chapter covers
- Introduction to neural networks and deep learning
- The transformer architecture
- GPT pre-training
- Key components of transformers
- GPT inference process
To understand how Large Language Models (LLMs) work, it's essential to grasp the "Transformer architecture." This architecture was introduced in 2017 in a paper titled "Attention is All You Need" by Ashish Vaswani, the Google Brain team, and Google Research (https://arxiv.org/abs/1706.03762). The paper is based on principles of attention, encoder-decoder concepts, and requires some foundational knowledge in artificial neural networks, embeddings, and positional encodings.
This appendix covers key concepts to help you understand how LLMs work. While not an in-depth exploration of neural networks or advanced topics due to space limitations, it provides a high-level overview. The goal is to give you the foundation needed to understand the architectural diagram from Chapter 1 (Figure 1.10), which is also included below for convenience (Figure A.1). This diagram is crucial for understanding the structure and functionality of LLMs. For a deeper understanding of the Transformer architecture, see Transformers in Action by Nicole Koenigstein or Build a Large Language Model (From Scratch) by Sebastian Raschka, both published by Manning.