This chapter covers:
- The inner workings of transformers.
- The relation of BERT to the Transformer architecture.
- The derivation by BERT of word embeddings, using Masked Language Modeling and positional encoding.
- The differences between BERT and word2vec, and their similarities.
- XLNET, a competitor of BERT.
We will go through the technical background of Transformers, and defer applications and detailed code to Chapter 10.
The following picture displays the chapter organization:
Figure 9.1. Chapter organization.
In late 2018, researchers from Google published a paper with a deep learning technique that would soon become a major breakthrough: Bidirectional Encoder Representations from Transformers, or BERT Devlin2018. BERT aims for deriving word embeddings from raw textual data just like word2vec, but does it in a much more clever and powerful manner: it takes into account both left and right context when learning vector representations for words. Recall that, in contrast, word2vec just uses a single piece of context. But this is not all. BERT is grounded in attention, and deploys, unlike word2vec, a deep network (recall that word2vec essentially uses a shallow network with just one hidden layer.)