concept ELMo in category nlp

appears as: ELMo
Transfer Learning for Natural Language Processing MEAP V04

This is an excerpt from Manning's book Transfer Learning for Natural Language Processing MEAP V04.

One notable weakness of the original formulation of word2vec was disambiguation. There was no way to distinguish between various uses of a word that may have different meanings depending on context, i.e., homographs, e.g., duck (posture) versus duck (bird), fair (a gathering) versus fair (just). In some sense, the original word2vec formulation represents each such word by the average vector of the vectors representing each of these distinct meanings of the homograph. Embeddings from Language Models[10] – abbreviated ELMo after the popular Sesame Street character – is an attempt to develop contextualized embeddings of words, using bidirectional LSTMs. The embedding of a word in this model depend very much on its context, with the corresponding numerical representation being different for each such context. ELMo did this by being trained to predict the next word in a sequence of words, which is very much related to the concept of language modeling that was introduced at the beginning of the chapter. Huge datasets, e.g., Wikipedia and various datasets of books, are readily available for training in this framework.

ELMo, which stands for “Embeddings from Language Models” is arguably the most popular early pretrained language model associated with the ongoing NLP transfer learning revolution. It shares a lot of architectural similarities with SIMOn, also being composed of character-level CNNs followed by bi-LSTMs. This similarity makes a deeper dive into the architecture of ELMo a natural next step after the introduction of SIMOn in this chapter. As with SIMOn, we apply ELMo to an illustrative example problem, namely ”fake news” detection, to provide a practical context. A visualization of the ELMo architecture is shown, in the context of tabular column type classification, in Figure 4.2. Some similarities and differences between the two frameworks are immediately evident. We can see that both frameworks employ character-level CNNs and bi-LSTMs. However, while SIMOn has two context building stages with RNNs – one for characters in a sentence and another for sentences in a document – ELMo has a single stage, focusing on building context for words in the input document.

Figure 4.2. Visualizing ELMo architecture in the context of the tabular column type classification example

Finally, we take a look at the ULMFiT framework, which stands for “Universal Language Model Fine-Tuning”. While this model has not gathered as much traction in the community as ELMo, this framework introduces and demonstrates some key techniques and concepts enabling adapting a pretrained language model for new settings more effectively. These include discriminative fine-tuning and gradual unfreezing. Discriminative fine-tuning stipulates that since the different layers of a language model contain different type of information, they should be tuned at different rates. Gradual unfreezing describes a procedure for fine-tuning progressively more parameters in a gradual manner that aims to reduce the risks of overfitting. The ULMFiT framework also includes innovations in varying the learning rate in a unique way during the adaptation process. We introduce the model after ELMo in this chapter, along with several of these concepts.

sitemap

Unable to load book!

The book could not be loaded.

(try again in a couple of minutes)

manning.com homepage
test yourself with a liveTest