chapter ten

10 Applications of Transformers: hands-on with BERT

 

This chapter covers

  • Creating a BERT layer for importing existing BERT models
  • Training BERT on data
  • Fine-tuning BERT
  • Extracting embeddings from BERT, and inspecting them

The chapter addresses the practicalities of working with the Transformer BERT in your implementations. We will not implement BERT ourselves. That would be a daunting job, and quite unnecessary, since BERT has been implemented efficiently in a variety of frameworks, including Keras. But we will get close to the inner workings of BERT code. We saw in Chapter 9 that BERT has been reported to improve NLP applications significantly. While we do not carry out an extensive comparison ourselves in this chapter, you are encouraged to revert back to the applications in the previous chapters, and swap for instance Word2Vec embeddings with BERT embeddings. With the material of Chapter 9 and 10 combined, you should be able to do so.

10.1 Introduction: working with BERT in practice

Figure 10.1. Practicalities of working with BERT.
mental model chapter10 intro

The financial costs of pretraining BERT and related models like XLNET from scratch on large amounts of data can be prohibitive. The original BERT paper ([Devlin2018-10], see Chapter 9) mentions that

  • "[The] training of BERT – Large was performed on 16 Cloud TPUs (64 TPU chips total) [with several pretraining phases]. Each pretraining [phase] took 4 days to complete.”

10.2 A BERT layer

10.3 Training BERT on your own data

10.4 Fine-tuning BERT

10.5 Inspecting BERT

10.5.1 Homonyms in BERT

10.6 Applying BERT

10.7 Summary

10.8 Further reading