8 Attention and transformer
This chapter covers:
- Using attention to produce summaries of the input and improve the quality of Seq2Seq models
- Replacing RNN-style loops with self-attention, a mechanism for the input to summarize itself
- Improving machine translation systems with the Transformer model
- Building a high-quality spell checker using the Transformer model and publicly available datasets
Our focus so far in this book has been recurrent neural networks (RNNs), which are a powerful model that can be applied to various NLP tasks such as sentiment analysis, named entity recognition, and machine translation. In this chapter, we will introduce an even powerful model—The Transformer[1]—a new type of encoder-decoder neural network architecture based on the concept of self-attention. It is without a doubt the most important NLP model since it appeared in 2017. Not only is it a powerful model itself (for machine translation and various Seq2Seq tasks, for example), but it is also used as the underlying architecture that powers numerous modern NLP pretrained models, including GPT-2 (section 8.4.3) and BERT (section 9.2). The developments in modern NLP since 2017 can be best summarized as “the era of the Transformer.”