10 ALBERT, adapters, and multitask adaptation strategies

 

This chapter covers

  • Applying embedding factorization and parameter sharing across layers
  • Fine-tuning a model from the BERT family on multiple tasks
  • Splitting a transfer learning experiment into multiple steps
  • Applying adapters to a model from the BERT family

In the previous chapter, we began our coverage of some adaptation strategies for the deep NLP transfer learning modeling architectures that we have covered so far. In other words, given a pretrained architecture such as ELMo, BERT, or GPT, how can transfer learning be carried out more efficiently? We covered two critical ideas behind the method ULMFiT, namely the concepts of discriminative fine-tuning and gradual unfreezing.

The first adaptation strategy we will touch on in this chapter revolves around two ideas aimed at creating transformer-based language models that scale more favorably with a bigger vocabulary and longer input length. The first idea essentially involves clever factorization, or splitting up a larger matrix of weights into two smaller matrices, allowing you to increase the dimensions of one without affecting the dimensions of the other. The second idea involves sharing parameters across all layers. These two strategies are the bedrock of the method known as ALBERT, A Lite BERT.1 We use the implementation in the transformers library to get some hands-on experience with the method.

10.1 Embedding factorization and cross-layer parameter sharing

10.1.1 Fine-tuning pretrained ALBERT on MDSD book reviews

10.2 Multitask fine-tuning

10.2.1 General Language Understanding Dataset (GLUE)

10.2.2 Fine-tuning on a single GLUE task

10.2.3 Sequential adaptation

10.3 Adapters

Summary