8 Attention and transformer

This chapter covers:

Using attention to produce summaries of the input and improve the quality of Seq2Seq models
Replacing RNN-style loops with self-attention, a mechanism for the input to summarize itself
Improving machine translation systems with the Transformer model
Building a high-quality spell checker using the Transformer model and publicly available datasets

Our focus so far in this book has been recurrent neural networks (RNNs), which are a powerful model that can be applied to various NLP tasks such as sentiment analysis, named entity recognition, and machine translation. In this chapter, we will introduce an even powerful model—The Transformer[1]—a new type of encoder-decoder neural network architecture based on the concept of self-attention. It is without a doubt the most important NLP model since it appeared in 2017. Not only is it a powerful model itself (for machine translation and various Seq2Seq tasks, for example), but it is also used as the underlying architecture that powers numerous modern NLP pretrained models, including GPT-2 (section 8.4.3) and BERT (section 9.2). The developments in modern NLP since 2017 can be best summarized as “the era of the Transformer.”

8.1 What is attention

8.1.1 Limitation of vanilla Seq2Seq models

8.1.2 Attention mechanism

8.2 Sequence to sequence with attention

8.2.1 Encoder-decoder attention

8.2.2 Building a Seq2Seq machine translation with attention

8.3 Transformer and self-attention

8.3.1 Self-attention

8.3.2 Transformer

8.3.3 Experiments

8.4 Transformer-based language models

8.4.1 Transformer as a language model

8.4.2 Transformer-XL

8.4.3 GPT-2

8.4.4 XLM

8.5 Case study: spell checker

8.5.1 Spell correction as machine translation

8.5.2 Training a spell checker

8.5.3 Improving a spell checker

8.6 Summary