chapter eight

8 Deep transfer learning for NLP with BERT and multilingual BERT

This chapter covers

Using pretrained Bidirectional Encoder Representations from Transformers (BERT) architecture to perform some interesting tasks
Using the BERT architecture for cross-lingual transfer learning

In this chapter and the previous chapter, our goal is to cover some representative deep transfer learning modeling architectures for natural language processing (NLP) that rely on a recently popularized neural architecture—the transformer¹—for key functions. This is arguably the most important architecture for NLP today. Specifically, our goal has to look at modeling frameworks such as the generative pretrained transformer (GPT),² Bidirectional Encoder Representations from Transformers (BERT),³ and multilingual BERT (mBERT).⁴ These methods employ neural networks with even more parameters than the deep convolutional and recurrent neural network models that we looked at previously. Despite their larger size, they have exploded in popularity because they scale comparatively more effectively on parallel computing architecture. This enables even larger and more sophisticated models to be developed in practice. To make the content more digestible, we split the coverage of these models into two chapters/parts: we covered the transformer and GPT neural network architectures in the previous chapter, and in this next chapter, we focus on BERT and mBERT.

8.1 Bidirectional Encoder Representations from Transformers (BERT)

8.1.1 Model architecture

8 Deep transfer learning for NLP with BERT and multilingual BERT

This chapter covers

8.1 Bidirectional Encoder Representations from Transformers (BERT)

8.1.1 Model architecture

8.1.2 Application to question answering

8.1.3 Application to fill in the blanks and next-sentence prediction tasks

8.2 Cross-lingual learning with multilingual BERT (mBERT)

8.2.1 Brief JW300 dataset overview

8.2.2 Transfer mBERT to monolingual Twi data with the pretrained tokenizer

8.2.3 mBERT and tokenizer trained from scratch on monolingual Twi data

Summary