9 Transformers

 

This chapter covers:

  • The inner workings of transformers.
  • The relation of BERT to the Transformer architecture.
  • The derivation by BERT of word embeddings, using Masked Language Modeling and positional encoding.
  • The differences between BERT and word2vec, and their similarities.
  • XLNET, a competitor of BERT.

We will go through the technical background of Transformers, and defer applications and detailed code to Chapter 10.

The following picture displays the chapter organization:

Figure 9.1. Chapter organization.
mental model chapter9 all

9.1 Introduction

Figure 9.2. Transformers encompass a large process for deriving word embeddings from raw textual data.
mental model chapter9 intro

In late 2018, researchers from Google published a paper with a deep learning technique that would soon become a major breakthrough: Bidirectional Encoder Representations from Transformers, or BERT Devlin2018. BERT aims for deriving word embeddings from raw textual data just like word2vec, but does it in a much more clever and powerful manner: it takes into account both left and right context when learning vector representations for words. Recall that, in contrast, word2vec just uses a single piece of context. But this is not all. BERT is grounded in attention, and deploys, unlike word2vec, a deep network (recall that word2vec essentially uses a shallow network with just one hidden layer.)

9.1.1 BERT up close: Transformers

9.1.2 Transformer encoders

9.1.3 Transformer decoders

9.2 BERT: Masked language modeling

9.2.1 BERT: training

9.2.2 BERT: fine-tuning

9.2.3 Beyond BERT

9.3 Summary

9.4 Further reading

sitemap