chapter five

5 Deep Transfer Learning for NLP with Transformers

This chapter covers:

Understanding the basics of the transformer neural network architecture
Using the generative pretrained transformer (GPT) to generate text
Using pre-trained bidirectional encoder representations from transformers (BERT) architecture to perform some interesting tasks.
Using the BERT architecture for cross-lingual transfer learning

In this chapter, we will cover some representative deep transfer learning modeling architectures for NLP that rely on a recently popularized neural architecture – the transformer[1] – for key functions. This is arguably the most important architecture for natural language processing (NLP) today. Specifically, we will be looking at modeling frameworks such as the generative pretrained transformer (GPT)[2], bidirectional encoder representations from transformers (BERT)[3] and multilingual BERT (mBERT)[4]. These methods employ neural networks with even more parameters than the deep convolutional and recurrent neural network models that we looked at in the previous chapter. Despite the larger size, they have exploded in popularity because they scale comparatively more effectively on parallel computing architecture. This enables even larger and more sophisticated models to be developed in practice.

5.1 The Transformer

5.1.1 An Introduction to the Transformers Library and Attention Visualization

5.1.2 Self-Attention

5.1.3 Residual Connections, Encoder-Decoder Attention and Positional Encoding

5.1.4 Application of Pretrained Encoder-Decoder to Translation

5.2 The Generative Pretrained Transformer

5.2.1 Architecture Overview

5.2.2 Transformers Pipelines Introduction and Application to Text Generation

5.2.3 Application to Chatbots

5.3 Bidirectional Encoder Representations from Transformers (BERT)

5.3.1 Model Architecture

5.3.2 Application to Question Answering

5.3.3 Application to Fill-in-the-Blanks and Next Sentence Prediction Tasks

5.4 Cross Lingual Learning with Multilingual BERT (mBERT)

5.4.1 Brief JW300 Dataset Overview

5.4.2 Transfer mBERT to Monolingual Twi Data with Pre-trained Tokenizer

5.4.3 mBERT and Tokenizer Trained from Scratch on Monolingual Twi Data

5.5 Summary