9 Transformers

 

This chapter covers

  • Understanding the inner workings of Transformers
  • Deriving word embeddings with BERT
  • Comparing BERT and Word2Vec
  • Working with XLNet

In late 2018, researchers from Google published a paper introducing a deep learning technique that would soon become a major breakthrough: Bidirectional Encoder Representations from Transformers, or BERT (Devlin et al. 2018). BERT aims to derive word embeddings from raw textual data just like Word2Vec, but does it in a much more clever and powerful manner: it takes into account both the left and right contexts when learning vector representations for words (figure 9.1). In contrast, Word2Vec uses a single piece of context. But this is not the only difference. BERT is grounded in attention and, unlike Word2Vec, deploys a deep network (recall that Word2Vec uses a shallow network with just one hidden layer.)

Figure 9.1 Transformers encompass a complex, attention-driven process for deriving word embeddings from raw textual data.

BERT smashed existing performance scores on all tasks it was applied to and (as we see later in this chapter) led to some not-so-trivial insights into deep neural language processing through the analysis of its attention patterns. So, how does BERT do all that? Let’s trace BERT back to its roots: Transformers. We will go through the technical background of Transformers in this chapter and defer applications and detailed code to chapter 10.

9.1 BERT up close: Transformers

9.2 Transformer encoders

9.2.1 Positional encoding

9.3 Transformer decoders

9.4 BERT: Masked language modeling

9.4.1 Training BERT

9.4.2 Fine-tuning BERT

9.4.3 Beyond BERT

Summary

sitemap