9 Transformers

This chapter covers

Showing you the inner workings of Transformers
Explaining the relation of BERT to the Transformer architecture
Deriving word embeddings by BERT, using Masked Language Modeling and positional encoding
Explaining the differences between BERT and Word2Vec, and their similarities
Introducing you to XLNET, a competitor of BERT

We will go through the technical background of Transformers, and defer applications and detailed code to Chapter 10.

9.1 Introduction

Figure 9.1. Transformers encompass a complex, attention-driven process for deriving word embeddings from raw textual data.

In late 2018, researchers from Google published a paper with a deep learning technique that would soon become a major breakthrough: Bidirectional Encoder Representations from Transformers, or BERT [Devlin2018]. BERT aims for deriving word embeddings from raw textual data just like _, but does it in a much more clever and powerful manner: it takes into account both left and right context when learning vector representations for words. Recall that, in contrast, word2vec just uses a single piece of context. But this is not all. BERT is grounded in _attention, and deploys, unlike Word2Vec, a deep network (recall that Word2Vec essentially uses a shallow network with just one hidden layer.)

9 Transformers

This chapter covers

9.1 Introduction

Figure 9.1. Transformers encompass a complex, attention-driven process for deriving word embeddings from raw textual data.

9.1.1 BERT up close: Transformers

9.1.2 Transformer encoders

9.1.3 Transformer decoders

9.2 BERT: Masked language modeling

9.2.1 Training BERT

9.2.2 Fine-tuning BERT

9.2.3 Beyond BERT

9.3 Summary

9.4 Further reading