9 Transformers
This chapter covers
- Showing you the inner workings of Transformers
- Explaining the relation of BERT to the Transformer architecture
- Deriving word embeddings by BERT, using Masked Language Modeling and positional encoding
- Explaining the differences between BERT and Word2Vec, and their similarities
- Introducing you to XLNET, a competitor of BERT
We will go through the technical background of Transformers, and defer applications and detailed code to Chapter 10.
Figure 9.1. Transformers encompass a complex, attention-driven process for deriving word embeddings from raw textual data.
In late 2018, researchers from Google published a paper with a deep learning technique that would soon become a major breakthrough: Bidirectional Encoder Representations from Transformers, or BERT [Devlin2018]. BERT aims for deriving word embeddings from raw textual data just like _, but does it in a much more clever and powerful manner: it takes into account both left and right context when learning vector representations for words. Recall that, in contrast, word2vec just uses a single piece of context. But this is not all. BERT is grounded in _attention, and deploys, unlike Word2Vec, a deep network (recall that Word2Vec essentially uses a shallow network with just one hidden layer.)