9 A line-by-line implementation of attention and Transformer

This chapter covers

  • The architecture and functionalities of encoders and decoders in Transformers
  • How the attention mechanism uses query, key, and value to assign weights to elements in a sequence
  • Different types of Transformers
  • Building a Transformer from scratch for language translation

Transformers are advanced deep learning models that excel in handling sequence-to-sequence prediction challenges, outperforming older models like recurrent neural networks (RNNs) and convolutional neural networks (CNNs). Their strength lies in effectively understanding the relationships between elements in input and output sequences over long distances, such as two words far apart in the text. Unlike RNNs, Transformers are capable of parallel training, significantly cutting down training times and enabling the handling of vast datasets. This transformative architecture has been pivotal in the development of large language models (LLMs) like ChatGPT, BERT, and T5, marking a significant milestone in AI progress.

9.1 Introduction to attention and Transformer

9.1.1 The attention mechanism

9.1.2 The Transformer architecture

9.1.3 Different types of Transformers

9.2 Building an encoder

9.2.1 The attention mechanism

9.2.2 Creating an encoder

9.3 Building an encoder-decoder Transformer

9.3.1 Creating a decoder layer

9.3.2 Creating an encoder-decoder Transformer

9.4 Putting all the pieces together

9.4.1 Defining a generator

9.4.2 Creating a model to translate between two languages

Summary