8 Deep Transfer Learning for NLP with BERT and Multilingual BERT
This chapter covers:
- Using pre-trained bidirectional encoder representations from transformers (BERT) architecture to perform some interesting tasks.
- Using the BERT architecture for cross-lingual transfer learning
In this chapter and the previous chapter, our goal is to cover some representative deep transfer learning modeling architectures for NLP that rely on a recently popularized neural architecture – the transformer[63] – for key functions. This is arguably the most important architecture for natural language processing (NLP) today. Specifically, our goal has been looking at modeling frameworks such as the generative pretrained transformer (GPT)[64], bidirectional encoder representations from transformers (BERT)[65] and multilingual BERT (mBERT)[66]. These methods employ neural networks with even more parameters than the deep convolutional and recurrent neural network models that we looked at in previously. Despite the larger size, they have exploded in popularity because they scale comparatively more effectively on parallel computing architecture. This enables even larger and more sophisticated models to be developed in practice. In order to make the content more digestible, we split the coverage of these models into two chapters/parts: while we covered the transformer and GPT neural network architectures in the previous chapter, in this next chapter we will focus on BERT and mBERT.