9 Transformers

This chapter covers:

The inner workings of transformers.
The relation of BERT to the Transformer architecture.
The derivation by BERT of word embeddings, using Masked Language Modeling and positional encoding.
The differences between BERT and word2vec, and their similarities.
XLNET, a competitor of BERT.

We will go through the technical background of Transformers, and defer applications and detailed code to Chapter 10.

The following picture displays the chapter organization:

Figure 9.1. Chapter organization.

9.1 Introduction

Figure 9.2. Transformers encompass a large process for deriving word embeddings from raw textual data.

In late 2018, researchers from Google published a paper with a deep learning technique that would soon become a major breakthrough: Bidirectional Encoder Representations from Transformers, or BERT Devlin2018. BERT aims for deriving word embeddings from raw textual data just like word2vec, but does it in a much more clever and powerful manner: it takes into account both left and right context when learning vector representations for words. Recall that, in contrast, word2vec just uses a single piece of context. But this is not all. BERT is grounded in attention, and deploys, unlike word2vec, a deep network (recall that word2vec essentially uses a shallow network with just one hidden layer.)

9 Transformers

This chapter covers:

Figure 9.1. Chapter organization.

9.1 Introduction

Figure 9.2. Transformers encompass a large process for deriving word embeddings from raw textual data.

9.1.1 BERT up close: Transformers

9.1.2 Transformer encoders

9.1.3 Transformer decoders

9.2 BERT: Masked language modeling

9.2.1 BERT: training

9.2.2 BERT: fine-tuning

9.2.3 Beyond BERT

9.3 Summary

9.4 Further reading