10 Applications of Transformers: Hands-on with BERT

This chapter covers

Creating a BERT layer for importing existing BERT models
Training BERT on data
Fine-tuning BERT
Extracting embeddings from BERT and inspecting them

This chapter addresses the practicalities of working with the BERT Transformer in your implementations. We will not implement BERT ourselves—that would be a daunting job and unnecessary since BERT has been implemented efficiently in various frameworks, including Keras. But we will get close to the inner workings of BERT code. We saw in chapter 9 that BERT has been reported to improve NLP applications significantly. While we do not carry out an extensive comparison in this chapter, you are encouraged to revisit the applications in the previous chapters and swap, for instance, Word2Vec embeddings with BERT embeddings. With the material in chapters 9 and 1–, you should be able to do so.

10.1 Introduction: Working with BERT in practice

The financial costs of pretraining BERT and related models like XLNet from scratch on large amounts of data can be prohibitive (figure 10.1). The original BERT paper (Devlin et al. 2018; see chapter 9) mentions that

[The] training of BERT - Large was performed on 16 Cloud TPUs (64 TPU chips total) [with several pretraining phases]. Each pretraining [phase] took 4 days to complete.

10 Applications of Transformers: Hands-on with BERT

This chapter covers

10.1 Introduction: Working with BERT in practice

10.2 A BERT layer

10.3 Training BERT on your data

10.4 Fine-tuning BERT

10.5 Inspecting BERT

10.5.1 Homonyms in BERT

10.6 Applying BERT

Summary