9 Transformers

 

This chapter covers

  • Showing you the inner workings of Transformers
  • Explaining the relation of BERT to the Transformer architecture
  • Deriving word embeddings by BERT, using Masked Language Modeling and positional encoding
  • Explaining the differences between BERT and Word2Vec, and their similarities
  • Introducing you to XLNET, a competitor of BERT

We will go through the technical background of Transformers, and defer applications and detailed code to Chapter 10.

9.1 Introduction

Figure 9.1. Transformers encompass a complex, attention-driven process for deriving word embeddings from raw textual data.
mental model chapter9 intro

In late 2018, researchers from Google published a paper with a deep learning technique that would soon become a major breakthrough: Bidirectional Encoder Representations from Transformers, or BERT [Devlin2018]. BERT aims for deriving word embeddings from raw textual data just like _, but does it in a much more clever and powerful manner: it takes into account both left and right context when learning vector representations for words. Recall that, in contrast, word2vec just uses a single piece of context. But this is not all. BERT is grounded in _attention, and deploys, unlike Word2Vec, a deep network (recall that Word2Vec essentially uses a shallow network with just one hidden layer.)

9.1.1 BERT up close: Transformers

9.1.2 Transformer encoders

9.1.3 Transformer decoders

9.2 BERT: Masked language modeling

9.2.1 Training BERT

9.2.2 Fine-tuning BERT

9.2.3 Beyond BERT

9.3 Summary

9.4 Further reading

sitemap