chapter nine

9 ULMFiT and Knowledge Distillation Adaptation Strategies

This chapter covers:

Implementing the previously introduced strategies of discriminative fine-tuning and gradual unfreezing
Executing knowledge distillation between teacher and student BERT models

In this chapter and the following chapter, we will cover some adaptation strategies for the deep NLP transfer learning modeling architectures that we have covered so far. In other words, given a pretrained architecture such as ELMo, BERT or GPT, how can transfer learning be carried out more efficiently? Several measures of efficiency could be employed here. We choose to focus on parameter efficiency, where the goal is to yield a model with the fewest parameters possible while suffering minimal reduction in performance. The purpose of this is to make the model smaller and easier to store, which would make it easier to deploy on smartphone devices, for instance. Alternatively, smart adaptation strategies may be required to just get to an acceptable level of performance in some difficult transfer cases.

9.1 Gradual Unfreezing and Discriminative Fine-Tuning

9 ULMFiT and Knowledge Distillation Adaptation Strategies

This chapter covers:

9.1 Gradual Unfreezing and Discriminative Fine-Tuning

9.1.1 Pretrained Language Model Fine-Tuning

9.1.2 Target Task Classifier Fine-Tuning

9.2 Knowledge Distillation

9.2.1 Transfer DistilmBERT to Monolingual Twi Data with Pretrained Tokenizer

9.3 Summary