chapter ten

10 Applications of Transformers: hands-on with BERT

This chapter covers

Creating a BERT layer for importing existing BERT models
Training BERT on data
Fine-tuning BERT
Extracting embeddings from BERT, and inspecting them

The chapter addresses the practicalities of working with the Transformer BERT in your implementations. We will not implement BERT ourselves. That would be a daunting job, and quite unnecessary, since BERT has been implemented efficiently in a variety of frameworks, including Keras. But we will get close to the inner workings of BERT code. We saw in Chapter 9 that BERT has been reported to improve NLP applications significantly. While we do not carry out an extensive comparison ourselves in this chapter, you are encouraged to revert back to the applications in the previous chapters, and swap for instance Word2Vec embeddings with BERT embeddings. With the material of Chapter 9 and 10 combined, you should be able to do so.

10.1 Introduction: working with BERT in practice

Figure 10.1. Practicalities of working with BERT.

The financial costs of pretraining BERT and related models like XLNET from scratch on large amounts of data can be prohibitive. The original BERT paper ([Devlin2018-10], see Chapter 9) mentions that

"[The] training of BERT – Large was performed on 16 Cloud TPUs (64 TPU chips total) [with several pretraining phases]. Each pretraining [phase] took 4 days to complete.”

10 Applications of Transformers: hands-on with BERT

This chapter covers

10.1 Introduction: working with BERT in practice

Figure 10.1. Practicalities of working with BERT.

10.2 A BERT layer

10.3 Training BERT on your own data

10.4 Fine-tuning BERT

10.5 Inspecting BERT

10.5.1 Homonyms in BERT

10.6 Applying BERT

10.7 Summary

10.8 Further reading