chapter nine

9 Transfer Learning with Pretrained Language Models

This chapter covers:

Using transfer learning to leverage knowledge from unlabeled textual data
Using self-supervised learning to pretrain large language models such as BERT
Building a sentiment analyzer with BERT and the HuggingFace Transformers library
Building a natural language inference model with BERT and AllenNLP

The year of 2018 is often called “an inflection point” in the history of NLP. A prominent NLP researcher Sebastian Ruder[1] dubbed this change “NLP’s ImageNet moment,” where he used the name of a popular computer vision dataset and powerful models pretrained on it, pointing out the similar changes were underway in the NLP community as well. Powerful pretrained language models such as ELMo, BERT, and GPT-2 achieved state-of-the-art performance in many NLP tasks and completely changed how we build NLP models within months.

One important concept underlying these powerful pretrained language models is transfer learning. In this chapter, we’ll first introduce the concept, then move onto introducing BERT, the most popular pretrained language model proposed for NLP. We’ll cover how BERT is designed and pretrained, as well as how to use the model for downstream NLP tasks including sentiment analysis and natural language inference. We’ll also touch upon other popular pretrained models including ELMo and RoBERTa.

9.1 Transfer learning

9.1.1 Traditional machine learning

9.1.2 Word embeddings

9.1.3 What is transfer learning

9.2 BERT

9.2.1 Limitations of word embeddings

9.2.2 Self-supervised learning

9.2.3 Pretraining BERT

9.2.4 Adapting BERT

9.3 Case study 1: sentiment analysis with BERT

9.3.1 Tokenizing input

9.3.2 Building the model

9.3.3 Training the model

9.4 Other pretrained language models

9.4.1 ELMo

9.4.2 XLNet

9.4.3 RoBERTa

9.4.4 DistilBERT

9.4.5 ALBERT

9.5 Case study 2: natural language inference with BERT

9.5.1 What is natural language inference

9.5.2 Using BERT for sentence pair classification

9.5.3 Using Transformers with AllenNLP

9.6 Summary