This chapter addresses the practicalities of working with the BERT Transformer in your implementations. We will not implement BERT ourselves—that would be a daunting job and unnecessary since BERT has been implemented efficiently in various frameworks, including Keras. But we will get close to the inner workings of BERT code. We saw in chapter 9 that BERT has been reported to improve NLP applications significantly. While we do not carry out an extensive comparison in this chapter, you are encouraged to revisit the applications in the previous chapters and swap, for instance, Word2Vec embeddings with BERT embeddings. With the material in chapters 9 and 1–, you should be able to do so.
The financial costs of pretraining BERT and related models like XLNet from scratch on large amounts of data can be prohibitive (figure 10.1). The original BERT paper (Devlin et al. 2018; see chapter 9) mentions that