In this chapter and the following chapter, we cover some representative deep transfer learning modeling architectures for NLP that rely on a recently popularized neural architecture—the transformer 1—for key functions. This is arguably the most important architecture for natural language processing (NLP) today. Specifically, we will be looking at modeling frameworks such as GPT,2 Bidirectional Encoder Representations from Transformers (BERT),3 and multilingual BERT (mBERT).4 These methods employ neural networks with even more parameters than the deep convolutional and recurrent neural network models that we looked at in the previous two chapters. Despite the larger size, these frameworks have exploded in popularity because they scale comparatively more effectively on parallel computing architecture. This enables even larger and more sophisticated models to be developed in practice. To make the content more digestible, we have split the coverage of these models into two chapters/parts: we cover the transformer and GPT neural network architectures in this chapter, whereas in the next chapter, we focus on BERT and mBERT.