This chapter covers:
- How to create a BERT layer for importing existing BERT models.
- How to train BERT on data.
- How to fine-tune BERT.
- How to extract embeddings from BERT, and inspect them.
The chapter addresses the practicalities of working with the Transformer BERT in your implementations. We will not implement BERT ourselves. That would be a daunting job, and quite unnecessary, since BERT has been implemented efficiently in a variety of frameworks, including Keras. But we will get close to the inner workings of BERT code. We saw in Chapter 9 that BERT has been reported to improve NLP applications significantly. While we do not carry out an extensive comparison ourselves in this chapter, you are encouraged to revert back to the applications in the previous chapters, and swap for instance word2vec embeddings with BERT embeddings. With the material of Chapter 9 and 10 combined, you should be able to do so.
The structure of the chapter is depicted in the following figure.
Figure 10.1. Chapter organization.
The financial costs of pretraining BERT and related models like XLNET from scratch on large amounts of data can be prohibitive. The original BERT paper (Devlin2018, see Chapter 9) mentions that
- "[The] training of BERT – Large was performed on 16 Cloud TPUs (64 TPU chips total) [with several pretraining phases]. Each pretraining [phase] took 4 days to complete.”