1 What is transfer learning?

published book

This chapter covers

What exactly transfer learning is, both generally in artificial intelligence (AI) and in the context of natural language processing (NLP)
Typical NLP tasks and the related chronology of NLP transfer learning advances
An overview of transfer learning in computer vision
The reason for the recent popularity of NLP transfer learning techniques

Artificial intelligence (AI) has transformed modern society in a dramatic way. Machines now perform tasks that human used to do, and they do them faster, cheaper, and, in some cases, more effectively. Popular examples include computer vision applications, which teach computers how to understand images and videos, such as for the detection of criminals in closed-circuit television camera feeds. Other computer vision applications include the detection of diseases from images of patients’ organs and the defining of plant species from plant leaves. Another important branch of AI, natural language processing (NLP), deals particularly with the analysis and processing of human natural language data. Examples of NLP applications include speech-to-text transcription and translation between various languages.

The most recent incarnation of the technical revolution in AI robotics and automation—which some refer to as the Fourth Industrial Revolution¹—was sparked by the intersection of algorithmic advances for training large neural networks, the availability of vast amounts of data via the internet, and the ready availability of massively parallel capabilities via graphical processing units (GPUs), which were initially developed for the personal gaming market. The recent rapid advances in the automation of tasks relying on human perception, specifically computer vision and NLP, required these strides in neural network theory and practice to happen. The growth of this area enabled the development of sophisticated representations of input data and desired output signals to handle these difficult problems.

At the same time, projections of what AI will be able to accomplish have significantly exceeded what has been achieved in practice. We are warned of an apocalyptic future that will erase most human jobs and replace us all, potentially even posing an existential threat to us. NLP is not excluded from this speculation, as it is today one of the most active research areas within AI. It is my hope that reading this book will contribute to helping you gain a better understanding of what is realistically possible to expect from AI, machine learning, and NLP in the near future. However, the main purpose of this book is to arm readers with a set of actionable skills related to a recent paradigm that has become important in NLP—transfer learning.

Transfer learning aims to leverage prior knowledge from different settings—be it a different task, language, or domain—to help solve a problem at hand. It is inspired by the way in which humans learn, because we typically do not learn things from scratch for any given problem but rather build on prior knowledge that may be related. For instance, learning to play a musical instrument is considered easier when one already knows how to play another instrument. Obviously, the more similar the instruments—an organ versus a piano, for example—the more useful prior knowledge is and the easier learning the new instrument will be. However, even if the instruments are vastly different—such as the drum versus the piano—some prior knowledge can still be useful, even if less so, such as the practice of adhering to a rhythm.

Large research laboratories, such as Lawrence Livermore National Laboratories or Sandia National Laboratories, and large internet companies, such as Google and Facebook, are able to learn large sophisticated models by training deep neural networks on billions of words and millions of images. For instance, Google's NLP model BERT (Bidirectional Encoder Representations from Transformers), which will be introduced in the next chapter, was pretrained on the English version of Wikipedia (2.5 billion words) and the BookCorpus (0.8 billion words).² Similarly, deep convolutional neural networks (CNNs) have been trained on more than 14 million images of the ImageNet dataset, and the learned parameters have been widely outsourced by a number of organizations. The amounts of resources required to train such models from scratch are not typically available to the average practitioner of neural networks today, such as NLP engineers working at smaller businesses or students at smaller schools. Does this mean that the smaller players are locked out of being able to achieve state-of-the-art results on their problems? Most definitely not—thankfully, the concept of transfer learning promises to alleviate this concern if applied correctly.

Why is transfer learning important?

Transfer learning enables you to adapt or transfer the knowledge acquired from one set of tasks and/or domains to a different set of tasks and/or domains. What this means is that a model trained with massive resources—including data, computing power, time, and cost—which were once open sourced can be fine-tuned and reused in new settings by the wider engineering community at a fraction of the original resource requirements. This represents a big step forward for the democratization of NLP and, more widely, AI. This paradigm is illustrated in figure 1.1, using the act of learning how to play a musical instrument as an example. It can be observed from the figure that information sharing between the different tasks/domains can lead to a reduction in data required to achieve the same performance for the later, or downstream, task B.

Figure 1.1 An illustration of the advantages of the transfer learning paradigm—shown in the bottom panel—where information is shared between systems trained for different tasks/domains, versus the traditional paradigm—shown in the top panel—where training occurs in parallel between tasks/domains. In the transfer learning paradigm, reduction in data and computing requirements can be achieved via the information/knowledge sharing. For instance, we expect a person to learn to play the drums more easily if they know how to play the piano first.

1.1 Overview of representative NLP tasks

The goal of NLP is to enable computers to understand natural human language. You can think of it as a process of systematically encoding natural language text into numerical representations that accurately portray its meaning. Although various taxonomies of typical NLP tasks exist, the following nonexhaustive list provides a framework for thinking about the scope of the problem and framing appropriately the various examples that will be addressed by this book. Note that some of these tasks may (or may not, depending on the specific algorithm selected) be required by other, more difficult, tasks on the list:

Part-of-speech (POS) tagging—Tagging a word in text with its part of speech; potential tags include verb, adjective, and noun.
Named entity recognition (NER)—Detecting entities in unstructured text, such as PERSON, ORGANIZATION, and LOCATION. Note that POS tagging could be part of an NER pipeline.
Sentence/document classification—Tagging sentences or documents with predefined categories, such as sentiments {“positive,” “negative”}, various topics {“entertainment,” “science,” “history”}, or some other predefined set of categories.
Sentiment analysis—Assigning to a sentence or document the sentiment expressed in it, for example, {“positive,” “negative”}. Indeed, you can arguably view this as a special case of sentence/document classification.
Automatic summarization—Summarizing the content of a collection of sentences or documents, usually in a few sentences or keywords.
Machine translation—Translating sentences/documents from one language into another language or a collection of languages.
Question answering—Determining an appropriate answer to a question posed by a human; for example, Question: What is the capital of Ghana? Answer: Accra.
Chatterbot/chatbot—Carrying out a conversation with a human convincingly, potentially aiming to accomplish some goal, such as maximizing the length of the conversation or extracting some specific information from the human. Note that a chatbot can be formulated as a question-answering system.
Speech recognition—Converting the audio of human speech into its text representation. Although a lot of effort has been and continues to be spent making speech recognition systems more reliable, in this book it is assumed that a text representation of the language of interest is already available.
Language modeling—Determining the probability distribution of a sequence of words in human language, where knowing the most likely next word in a sequence is particularly important for language generation—predicting the next word or sentence.
Dependency parsing—Splitting a sentence into a dependency tree that represents its grammatical structure and the relationships between its words. Note that POS tagging can be important here.

1.2 Understanding NLP in the context of AI

Before proceeding with the rest of this book, it is important to understand the term natural language processing and to correctly situate it with respect to other commonly encountered terms, such as artificial intelligence, machine learning, and deep learning. The popular media often assign meanings to these terms that do not match their use by machine learning scientists and engineers. As such, it is important to kick off our journey by defining precisely what we mean when we use these terms, as shown in the Venn diagram in figure 1.2.

Figure 1.2 A Venn diagram visualization of the terms natural language processing (NLP), artificial intelligence (AI), machine learning, and deep learning relative to each other. Symbolic AI is also shown.

As you can see, deep learning is a subset of machine learning, which in turn is a subset of AI. NLP is a subset of AI as well, with a nonempty intersection with deep learning and machine learning. This figure expands on the one presented by François Chollet.³ Please see chapter 6 and section 8.1 of his book for a good overview of the application of neural nets to text. Symbolic AI is also shown in the figure and will be described in the next subsection.

1.2.1 Artificial intelligence (AI)

Artificial intelligence as a field came about in the middle of the 20th century, as a broad effort to make computers mimic and perform tasks typically carried out by human beings. Initial approaches focused on manually deriving and hard-coding explicit rules for manipulating input data for each circumstance of interest. This paradigm is typically referred to as symbolic AI. It worked for well-defined problems such as chess but notably stumbled when encountering problems from the perception category, such as vision and speech recognition. A new paradigm was needed, one where the computer could learn new rules from data, rather than having a human supervisor specify them explicitly. This led to the rise of machine learning.

1.2.2 Machine learning

In the 1990s, the paradigm of machine learning became the dominant trend in AI. Instead of explicitly programming a computer for every possible scenario, the computer would now be trained to associate input to output signals by seeing many examples of such corresponding input-output pairs. Machine learning employs heavy mathematical and statistical machinery, but because it tends to deal with large and complex datasets, the field relies more on experimentation, empirical observations, and engineering than mathematical theory.

A machine learning algorithm learns a representation of input data that transforms it into appropriate output. For that, it needs a collection of data, such as a set of sentence inputs in a sentence classification task, and a set of corresponding outputs, for example, tags such as {“positive,” “negative”} for sentence classification. Also needed is a loss function, which measures how far the current output of the machine learning model is from the expected output of the dataset. To aid in understanding, consider a binary classification task, where the goal of machine learning might be to pick a function called the decision boundary that will cleanly separate data points of the different types, as shown in figure 1.3. This decision boundary should generalize beyond training data to unseen examples. To make this boundary easier to find, you might want to first preprocess or transform the data into a form more amenable for separation. We seek such transformations from the allowable set of functions called the hypothesis set. Automatically determining such a transformation, which makes the machine learning end goal easier to accomplish, is specifically what is referred to as learning.

Figure 1.3 An illustrative example of a major motivating task in machine learning: finding a decision boundary in the hypothesis set to effectively separate different types of points from each other. In the case shown in this figure, the hypothesis set may be the set of arcs.

Machine learning automates this process of searching for the best input-output transformation inside some predefined hypothesis set, using guidance from some feedback signal embodied by the loss function. The nature of the hypothesis set determines the class of algorithms under consideration, as we outline next.

Classical machine learning is initiated with probabilistic modeling approaches such as naive Bayes. Here, we make a naive assumption that the input data features are all independent. Logistic regression is a related method and typically the first one a data scientist will try on a dataset to baseline it. The hypothesis sets for both of these classes of methods are sets of linear functions.

Neural networks were initially developed in the 1950s, but it was not until the 1980s that an efficient way to train large networks was discovered—backpropagation coupled with the stochastic gradient descent algorithm. While backpropagation provides a way to compute gradients for the network, stochastic gradient descent uses these gradients to train the network. We review these concepts briefly in appendix B. The first successful practical application occurred in 1989, when Yann LeCun of Bell Labs built a system for recognizing handwritten digits, which was then used heavily by the US Postal Service.

Kernel methods rose in popularity in the 1990s. These methods attempt to solve classification problems by finding good decision boundaries between sets of points, as was conceptualized in figure 1.3. The most popular such method is the support vector machine (SVM). Attempts to find a good decision boundary proceed by mapping the data to a new high-dimensional representation where hyperplanes are valid boundaries. The distance between the hyperplane and the closest data points in each class is then maximized. The high computational cost of operating in the high-dimensional space is alleviated using the kernel trick. Instead of computing high-dimensional data representations explicitly, a kernel function is used to compute distances between points at a fraction of the computing cost. This class of methods is backed by solid theory and is amenable to mathematical analysis, which is linear when the kernel is a linear function—attributes that made these methods extremely popular. However, performance on perceptual machine learning problems left much to be desired, because these methods first required a manual feature engineering step, which was brittle and prone to error.

Decision trees and related methods are another class of algorithms that is still widely used. A decision tree is a decision support aid that models decisions and their consequences as trees, that is, a graph where any two nodes are connected by exactly one path. Alternatively, a tree can be defined as a flowchart that transforms input values into output categories. The popularity of decision trees rose in the 2010s, when methods relying on them began to be preferred over kernel methods. This popularity benefited from their ease of visualization, comprehension, and explainability. To aid in understanding, figure 1.4 shows an example decision tree structure that classifies the input {A,B} in category 1 if A<10, category 2 if A>=10 while B<25, and category 3 otherwise.

Figure 1.4 Example decision tree structure that classifies the input {A,B} in category 1 if A<10, category 2 if A>=10 while B<25, and category 3 otherwise

Random forests provide a practical machine learning method for applying decision trees. This method involves generating a large number of specialized trees and combining their outputs. Random forests are extremely flexible and widely applicable, making them often the second algorithm to try after logistic regression for baselining. When the Kaggle open competition platform started out in 2010, random forests quickly became the most widely used algorithm on the platform. In 2014, gradient-boosting machines took over. They iteratively learn new decision-tree-based models that address weak points of models from the previous iterations. At the time of this writing, they are widely considered to be the best class of methods for addressing nonperceptual machine learning problems. They are still extremely popular on Kaggle.

Around 2012, GPU-trained deep convolutional neural networks (CNNs) began to win the yearly ImageNet competition, marking the beginning of the current deep learning “golden age.” CNNs started to dominate all major image-processing tasks, such as object recognition and object detection. Similarly, we can find applications in the processing of human natural language, that is, NLP. Neural networks learn via a succession of increasingly meaningful, layered representations of the input data. The number of these layers specifies the depth of the model. This is where the term deep learning—the process of training deep neural networks—comes from. To distinguish them from deep learning, all aforementioned machine learning methods are often referred to as shallow or traditional learning methods. Note that neural networks with a small depth would also be classified as shallow but not traditional. Deep learning has come to dominate the field of machine learning, being a clear favorite for perceptual problems and sparking a revolution in the complexity of problems that can be handled.

Although neural networks were inspired by neurobiology, they are not direct models of how our nervous system works. Every layer of a neural network is parameterized by a set of numbers, referred to as the layer’s weights, specifying exactly how it transforms the input data. In deep neural networks, the total number of parameters can easily reach into the millions. The already-mentioned backpropagation algorithm is the algorithmic engine used to find the right set of parameters, that is, to learn the network. A visualization of a simple neural network with two fully connected hidden layers is shown in figure 1.5. Also shown on the right is a summarized visualization of the same, which we will often employ. A deep neural network would have many such layers. A notable neural network architecture that does not conform to such a feedforward nature is the long short-term memory (LSTM) recurrent neural network (RNN) architecture. Unlike the feedforward architecture in figure 1.5, which accepts a fixed-length input of length 2, LSTMs can process input sequences with arbitrary lengths.

Figure 1.5 Visualization of a simple feedforward neural network with two fully connected hidden layers (left). On the right is a summarized equivalent representation, which we will often employ to simplify diagrams.

As previously touched on, what sparked the most recent interest in deep learning was spanned hardware, the availability of vast amounts of data, and algorithmic progress. GPUs had been developed for the video gaming market, and the internet matured to begin providing the field with unprecedented quality and quantity of data. Wikipedia, YouTube, and ImageNet are specific examples of data sources, the availability of which has driven many advances in computer vision and NLP. The ability of neural networks to eliminate the need for expensive manual feature engineering—which is needed to apply shallow learning methods to perceptual data successfully—is arguably the factor that influenced the ease of adoption of deep learning. Because NLP is a perceptual problem, it will also be the most important class of machine learning algorithms addressed in this book, albeit not the only one.

Next, we aim to get some insight into the history and progression of advances in NLP.

1.2.3 Natural language processing (NLP)

Language is one of the most important aspects of human cognition. It stands to reason that in order to create true artificial intelligence, a machine needs to be taught how to interpret, understand, process, and act on human language. This underlines the importance of NLP to the fields of AI and machine learning.

Just like the other subfields of AI, initial approaches to handling NLP problems, such as sentence classification and sentiment analysis, were based on explicit rules, or symbolic AI. Such systems typically could not generalize to new tasks and would break down easily. Since the advent of kernel methods in the 1990s, human effort has been channeled toward feature engineering—transforming the input data manually into a form that the shallow learning methods could use to produce useful predictions. This method is extremely time-consuming, task-specific, and inaccessible to a nonexpert. The advent of deep learning, around 2012, sparked a true revolution in NLP. The ability of neural networks to automatically engineer appropriate features in some of their layers lowered the bar for the applicability of these methods to new tasks and problems. Human effort then focused on designing the appropriate neural network architecture for any given task, as well as tuning various hyperparameter settings during training.

The standard way to train NLP systems involves collecting a large set of data points, each reliably annotated with output labels, such as “positive” or “negative,” in a sentiment analysis task of sentences or documents. These data points are then supplied to the machine learning algorithm to learn the best representation or transformation of input to output signals that could potentially generalize to new data points. Both within NLP and in other subfields of machine learning, this process is often referred to as the paradigm of supervised learning. The labeling process, which is typically done manually, provides the “supervision signal” for learning the representative transformation. Learning representations from unlabeled data, on the other hand, is referred to as unsupervised learning.

Although today’s machine learning algorithms and systems are not a direct replica of biological learning systems and should not be considered models of such systems, some of their aspects are inspired by evolutionary biology, and, in the past, inspirations drawn from biology have guided significant advances. Based on this, it seems flawed that for each new task, language, or application domain, the supervised learning process has traditionally been repeated from scratch. This process is somewhat antithetical to the way natural systems learn—building on and reusing previously acquired knowledge. Despite this, significant advances have been achieved in learning for perceptual tasks from scratch, notably in machine translation, question-answering systems, and chatbots, although some drawbacks remain. In particular, today’s systems are not robust in handling significant changes in the sample distribution from which the input signals are drawn. In other words, the systems learn how to perform well on inputs of a certain kind or type. If we change the input type, it can lead to a significant degradation in performance and sometimes absolute failure. Moreover, to fully democratize AI and make NLP accessible to the average engineer at a small business—or to anyone without the resources possessed by major internet companies—it would be extremely helpful to be able to download and reuse knowledge acquired elsewhere. This is also important to anyone living in a country where the lingua franca may differ from English or other popular languages for which pretrained models exist, as well as anyone working on tasks that may be unique to their part of the world or tasks that no one has ever explored. Transfer learning provides a way to address some of these issues.

Transfer learning enables one to literally transfer knowledge from one setting—which we define as a combination of a particular task, domain, and language—to a different setting. The original setting is naturally referred to as the source setting, and the final setting is referred to as the target setting. The ease and success of the transfer process hinges on the similarity of the source and target settings. Quite naturally, a target setting that is “similar” to the source in some sense, which we will define later on in this book, leads to an easier and more successful transfer.

Transfer learning has been in implicit use in NLP for much longer than most practitioners realize, because it is a common practice to vectorize words using pretrained embeddings such as word2vec or sent2vec (more on these in the next section). Shallow learning methods have typically been applied to these vectors as features. We cover both of these techniques in more detail in the next section and in chapter 4 and apply them in various ways throughout the book. This popular approach relies on an unsupervised preprocessing step, which is used to first train these embeddings without any labels. Knowledge from this step is then transferred to the specific application in a supervised setting, where the said knowledge is refined and specialized to the problem at hand using a shallow learning algorithm on a smaller set of labeled examples. Traditionally, this paradigm of combining unsupervised and supervised learning steps has been referred to as semisupervised learning.

We next expand on the historical progression of advances in NLP, with a particular focus on the role transfer learning has played recently in this important subfield of AI and machine learning.

1.3 A brief history of NLP advances

To frame your understanding of the state and importance of transfer learning in NLP, it can be helpful to first gain a better sense of the kinds of tasks and techniques that have historically been important for this subfield of AI. This section covers these tasks and techniques and culminates in a brief overview of recent advances in NLP transfer learning. This overview will help you appropriately contextualize the impact of transfer learning in NLP and understand why it is more important now than ever before.

1.3.1 General overview

NLP was born in the middle of the 20th century, alongside AI. A major historical NLP landmark was the Georgetown Experiment of 1954, in which a set of approximately 60 Russian sentences was translated into English. In the 1960s, the Massachusetts Institute of Technology (MIT) NLP system ELIZA convincingly simulated a psychotherapist. Also in the 1960s, the vector space model for information representation was developed, where words came to be represented by vectors of real numbers, which were amenable to computation. The 1970s saw the development of a number of chatterbot/ chatbot concepts based on sophisticated sets of handcrafted rules for processing the input information.

In the 1980s and 1990s, we saw the advent of the application of systematic machine learning methodologies to NLP, where rules were discovered by the computer versus being crafted by humans. This advance coincided with the explosion in the wider popularity of machine learning during that time, as we have already discussed earlier in this chapter. The late 1980s witnessed the application of singular value decomposition (SVD ) to the vector space model, leading to latent semantic analysis—an unsupervised technique for determining the relationship between words in language.

In the early 2010s, the rise of neural networks and deep learning in the field dramatically transformed NLP. Such techniques were shown to achieve state-of-the-art results for the most difficult NLP tasks, such as machine translation and text classification. The mid-2010s witnessed the development of the word2vec model,⁴ and its variants sent2vec,⁵ doc2vec,⁶ and so on. These neural-network-based techniques vectorize words, sentences, and documents (respectively) in a way that ensures the distance between vectors in the generated vector space is representative of the difference in meaning between the corresponding entities, that is, words, sentences, and documents. Indeed, some interesting properties of such embeddings allowed analogies to be handled—the distance between the words Man and King are approximately equal to the distance between the words Woman and Queen in the induced vector space, for instance. The metric used to train these neural-network-based models was derived from the field of linguistics, more specifically distributional semantics, and did not require labeled data. The meaning of a word was assumed to be tied to its context, that is, the words surrounding it.

The variety of methods for embedding various units of text, such as words, sentences, paragraphs, and documents, became a key cornerstone of modern NLP. Once text samples are embedded into an appropriate vector space, analysis can often be reduced to the application of a well-known shallow statistical/machine learning technique for real vector manipulation, including clustering and classification. This can be viewed as a form of implicit transfer learning, and a semisupervised machine learning pipeline—the embedding step is unsupervised and the learning step is typically supervised. The unsupervised pretraining step essentially reduces the requirements for labeled data and, thereby, computing resources required to achieve a given performance—something we will learn to leverage transfer learning to do for us for a broader range of scenarios in this book.

Around 2014, sequence-to-sequence models ⁷ were developed and achieved a significant improvement in difficult tasks such as machine translation and automatic summarization. In particular, whereas pre-neural network NLP pipelines consist of several explicit steps, such as POS tagging, dependency parsing, and language modeling, it was shown that machine translation could be carried out “sequence to sequence.” Here the various layers of a deep neural network automate all of these intermediate steps. These models learn to associate an input sequence, such as a source sentence in one language, with an output sequence—for example, that sentence’s translation into another language—via an encoder that converts inputs into a context vector and a decoder that converts it into the target sequence. Both the encoder and decoder were typically designed to be recurrent neural networks (RNNs ). These are able to encode order information in the input sentence, something earlier models, such as the bag-of-words model, couldn’t do, leading to significant improvements in performance.

It was discovered, however, that long input sequences were harder to deal with, which motivated the development of the technique known as attention. This technique significantly improved the performance of machine translation sequence-to-sequence models by allowing the model to focus on the parts of the input sequence that were most relevant for the output. A model called the transformer ⁸ took this a step further by defining a self-attention layer for both the encoder and decoder, allowing both to build better context for text segments with respect to other text segments in the input sequence. Significant improvements in machine translation were achieved with this architecture, and it was observed to be better suited for training on massively parallel hardware than prior models, speeding up training by up to an order of magnitude.

Up until about 2015, most practical methods for NLP focused on the word level, which means that the whole word was treated as an indivisible atomic entity and assigned a feature vector. This approach has several disadvantages, notably how to treat never-before-seen or out-of-vocabulary words. When the model encountered such words—for instance, if a word was misspelled—the method would fail because it could not vectorize it. In addition, the rise of social media changed the definition of what was considered natural language. Now, billions of people express themselves online using emoticons, newly invented slang, and deliberately misspelled words. It was not long until it was realized that the solution to many of these issues came naturally from treating language at the character level. In this paradigm, every character would be vectorized, and as long as the human was expressing themself with allowable characters, vector features could be generated successfully, and the algorithm could be successfully applied. Zhang et al.⁹ showed this in the context of character-level CNNs for text classification and demonstrated a remarkable robustness to misspellings.

1.3.2 Recent transfer learning advances

Traditionally, learning has proceeded in either a fully supervised or fully unsupervised fashion for any given problem setting—a particular combination of task, domain, and language—from scratch. As previously alluded to, semisupervised learning was recognized as early as 1999, in the context of SVMs, as a way to address potentially limited labeled data availability. An initial unsupervised pretraining step on larger collections of unlabeled data made downstream supervised learning easier. Variants of this were studied to address potentially noisy—possibly incorrect—labels, which is an approach sometimes referred to as weakly supervised learning. However, it was often assumed that the same sampling distribution held for both the labeled and unlabeled datasets.

Transfer learning relaxes these assumptions. In 1995, at the Conference on Neural Information Processing Systems (NeurIPS), one of the biggest conferences on machine learning, transfer learning was popularly recognized as “learning to learn.” Essentially, it was stipulated that intelligent machines need to possess lifelong learning capabilities that reuse learned knowledge for new tasks. This has since been studied under a few different names, including learning to learn, knowledge transfer, inductive bias, and multitask learning. In multitask learning, an algorithm is trained to perform well on multiple tasks simultaneously, thereby uncovering features that may be more generally useful. However, it wasn’t until around 2018 that practical and scalable methods were developed to achieve it in NLP for the hardest perceptual problems.

The year 2018 saw nothing short of a revolution in the field of NLP. The understanding in the field of how to best represent collections of text as vectors evolved dramatically. Moreover, it became widely recognized that open source models could be fine-tuned or transferred to different tasks, languages, and domains. At the same time, several of the big internet companies released even more and bigger NLP models for computing such representations and also specified well-defined procedures for fine-tuning them. All of a sudden, the ability to attain state-of-the-art results in NLP became accessible to the average practitioner, even an independent one. Some called it NLP’s “ImageNet moment,” referencing the explosion in computer vision applications witnessed post-2012, when a GPU-trained neural network won the ImageNet computer vision competition. Just as was the case for the original ImageNet moment, for the first time, a library of pretrained models became available for a large subset of NLP data, together with well-defined techniques for fine-tuning them to particular tasks at hand with labeled datasets of a size significantly smaller than would be needed otherwise. This book’s purpose is to describe, elucidate, evaluate, demonstrably apply, compare, and contrast the various techniques that fall into this category. We briefly overview these techniques next.

Early explorations of transfer learning for NLP focused on analogies to computer vision, where it has been used successfully for over a decade. One such model— Semantic Inference for the Modeling of Ontologies (SIMOn)¹⁰—employed character-level convolutional neural networks (CNNs) combined with bidirectional LSTMs for structural semantic text classification. The SIMOn approach demonstrated NLP transfer learning methods directly analogous to those that have been used in computer vision. The rich body of knowledge on transfer learning for computer vision applications motivated this approach. The features learned by this model were shown to be useful for unsupervised learning tasks and to work well on social media language data, which can be somewhat idiosyncratic and very different from the kind of language on Wikipedia and other large book-based datasets.

One notable weakness of the original formulation of word2vec was disambiguation. There was no way to distinguish between various uses of a word that may have different meanings depending on context, such as the case of homographs—duck (posture) versus duck (bird) or fair (a gathering) versus fair (just). In some sense, the original word2vec formulation represents each such word by the average vector of the vectors representing each of these distinct meanings of the homograph. Embeddings from Language Models¹¹—abbreviated ELMo after the popular Sesame Street character—is an attempt to develop contextualized embeddings of words using bidirectional LSTMs. The embedding of a word in this model depends very much on its context, with the corresponding numerical representation being different for each such context. ELMo did this by being trained to predict the next word in a sequence of words, which is very much related to the concept of language modeling that was introduced at the beginning of the chapter. Huge datasets, like Wikipedia and various datasets of books, are readily available for training in this framework.

The Universal Language Model Fine-Tuning ¹² (ULMFiT) method was proposed to fine-tune any neural-network-based language model for any particular task and was initially demonstrated in the context of text classification. A key concept behind this method is discriminative fine-tuning, where the different layers of the network are trained at different rates. The OpenAI Generative Pretrained Transformer (GPT) modified the encoder-decoder architecture of the transformer to achieve a fine-tunable language model for NLP. It discarded the encoders and retained the decoders and their self-attention sublayers. Bidirectional Encoder Representations from Transformers¹³ (BERT) did the opposite, modifying the transformer architecture by preserving the encoders and discarding the decoders and also relying on masking of words, which would then need to be predicted accurately as the training metric. These concepts will be discussed in detail in the upcoming chapters.

In all of these language-model-based methods—ELMo, ULMFiT, GPT, and BERT—it was shown that generated embeddings could be fine-tuned for specific downstream NLP tasks with relatively few labeled data points. The focus on language models was deliberate: it was hypothesized that the hypothesis set induced by them would be generally useful, and the data for massive training was known to be readily available.

Next, we highlight key aspects of transfer learning in computer vision to even better frame transfer learning in NLP and to see if anything can be learned and borrowed for our purposes. This knowledge will become a rich source of analogies that will be used to drive our exploration of NLP transfer learning in the remainder of the book.

1.4 Transfer learning in computer vision

Although the target of this book is NLP, it is helpful to frame NLP transfer learning in the context of computer vision transfer learning. One reason for doing this is that neural network architectures from the two subfields of AI may share some similar features, so techniques from computer vision can be borrowed, or, at the very least, be used to inform, techniques for NLP. Indeed, the availability of such techniques in computer vision is arguably a large driver behind recent NLP transfer learning research. Researchers can access a library of well-defined computer vision methods to experiment with in the relatively unexplored domain of NLP. The extent to which such techniques are directly transferable is, however, an open question, and it is important to remain mindful of a number of important differences. One such difference is that NLP neural networks tend to be shallower than those used in computer vision.

1.4.1 General overview

The goal of computer vision or machine vision is to enable computers to understand digital images and/or videos, including methods for acquiring, processing, and analyzing image data and making decisions based on their derived representation. Video analysis can typically be carried out by splitting videos into frames, which can then be viewed as an image analysis problem. Thus, computer vision can be posed as an image analysis problem theoretically without the loss of generality.

Computer vision was born along with AI in the middle of the 20th century. Vision, obviously, is an important part of cognition, so researchers seeking to build intelligent robots recognized it as being important early on. Initial methods in the 1960s attempted to mimic the human visual system, whereas focus on extracting edges and modeling of shapes in scenes rose in popularity in the 1970s. The 1980s witnessed more mathematically robust methods developed for various aspects of computer vision, notably facial recognition and image segmentation, with mathematically rigorous treatments emerging by the 1990s. This move coincided with the rise in popularity of machine learning during that time, as we already touched on. The following couple of decades saw focus and effort spent on developing better feature-extraction methods for images, prior to the application of a shallow machine learning technique. The “ImageNet moment” of 2012, when GPU-accelerated neural networks won the prominent ImageNet competition by a wide margin for the very first time, marked a revolution in the field.

ImageNet¹⁴ was originally published in 2009 and rapidly became the basis of a competition for testing the best methods for object recognition. The famed 2012 neural network entry pointed to deep learning as the way forward for computer vision in particular and perceptual problems in machine learning in general. Importantly for us, a number of researchers quickly realized that neural network weights from pretrained ImageNet models could be used to initialize neural network models for other, sometimes seemingly unrelated, tasks and achieve a significant improvement in performance.

1.4.2 Pretrained ImageNet models

The various teams that have won the hallmark ImageNet yearly competition have been very generous with sharing their pretrained models. Notable examples of such CNN models follow.

The VGG architecture was initially introduced in 2014, with variants VGG16 (a depth of 16) and VGG19 (a depth of 19 layers). To make the deeper network converge during training, the shallower network needed to be trained until convergence first and its parameters used to initialize the deeper network. This architecture has been found to be somewhat slow to train and relatively large in terms of overall number of parameters—130 million to 150 million parameters in size.

Some of these issues were addressed by the ResNet architecture in 2015. Despite being substantially deeper, the number of parameters was significantly reduced—the smallest variant, ResNet50, is 50 layers deep with approximately 50 million parameters. A key to achieving this reduction was regularization via a technique called max pooling and a modular design out of subbuilding blocks.

Other notable examples include Inception and its extension Xception, proposed in 2015 and 2016, respectively, which aim to create multiple levels of extracted features by stacking multiple convolutions within the same network module. Both of these models achieved further significant reduction in model size.

1.4.3 Fine-tuning pretrained ImageNet models

Due to the existence of the pretrained CNN ImageNet models that have been presented, it is uncommon for practitioners to train computer vision models from scratch. By far the more common approach is to download one of these open source models and either use it to initialize a similar architecture prior to learning on limited labeled data, such as fine-tuning a subset of the layers, or to use it as a fixed feature extractor.

A visualization of how a subset of layers to be fine-tuned is typically selected in a feedforward neural network is shown in figure 1.6. A threshold is moved away from the output (and toward the input) as more data becomes available in the target domain, with layers between the threshold and output retrained. This change occurs because the increased amount of data can be used to train more parameters effectively than could be done otherwise. Additionally, movement of the threshold must happen in the right-to-left direction, that is, away from the output and toward the input. This movement direction allows us to retain layers encoding general features that are close to the input, while retraining layers closer to the output, which encode features specific to the source domain. Moreover, when the source and target are highly dissimilar, some of the more specific parameters/layers to the right of the threshold can be discarded.

Feature extraction, on the other hand, involves removing only the last layer of the network, which, instead of producing data labels, will now produce a set of numerical vectors on which a shallow machine learning method, such as the support vector machine (SVM), can be trained as before.

In the retraining or fine-tuning approach, the prior pretrained weights do not all stay fixed, but a subset of them can be allowed to change based on the new labeled data. However, it is important to make sure that the number of parameters being trained does not lead to overfitting on limited new data, which motivates us to freeze some to reduce the number of parameters being trained. Picking the number of layers to freeze has typically been done empirically, with the heuristics in figure 1.6 guiding it.

Figure 1.6 Visualization of the various transfer learning heuristics applicable in computer vision for feedforward neural network architectures, which we will draw on in NLP whenever possible. A threshold is moved to the left, with more availability of training data in the target domain, and all parameters to the right of it are retrained, with the exception of those that are discarded due to increasing dissimilarity between source and target domains.

It has been established in CNNs that the early layers—those closer to the input layer—perform functions more general to the task of image processing, such as detecting any edges in the image. Later layers—those closer to the output layer—perform functions more specific to the task at hand, such as mapping final numerical outputs to specific labels. This arrangement leads us to unfreeze and fine-tune layers closer to the output layer first and then incrementally unfreeze and fine-tune layers closer to the input layer if performance is found to be unsatisfactory. This process can continue as long as the available labeled dataset for the target task can support the increase in training parameters.

A corollary of this process is that if the labeled dataset for the target task is very large, the whole network should probably be fine-tuned. If the target dataset is small, on the other hand, one needs to think carefully about how similar the target dataset is to the source dataset. If it is very similar, the model architecture can be directly initialized to pretrained weights when fine-tuning. If very different, it may be beneficial to discard the pretrained weights in some of the later layers of the network when initializing, because they may not have any relevance to the target task. Moreover, because the dataset is not large, only a small set of the remaining later layers should be unfrozen while fine-tuning.

We will conduct computational experiments to explore these heuristics further in the subsequent chapters.

1.5 Why is NLP transfer learning an exciting topic to study now?

Now that we have framed the current state of NLP in the context of the general artificial intelligence and machine learning landscapes, we are in a good position to summarize why the key theme of this book is important and why you, the reader, should care very much about it.

By now it should be clear that recent years have seen a rapid acceleration in advances in this field. A number of pretrained language models have been made available, along with well-defined procedures, for the very first time, for fine-tuning them to more specific tasks or domains. It was discovered that analogies could be made to the way transfer learning had been conducted in computer vision for a decade, and a number of research groups were able to rapidly draw on a body of existing computer vision techniques to push forward the state of our understanding of NLP transfer learning. This work has achieved the important advantage of reducing computing and training time requirements for these problems for the average practitioner without access to massive resources.

A lot of excitement exists in the field right now, and droves of researchers are working on this problem area. A lot of outstanding questions in a subject this novel present an opportunity for machine learning researchers to make a name for themselves by helping move the state of knowledge forward. Simultaneously, social media, which has become an increasingly significant factor in human interaction, presents new challenges not seen in NLP before. These challenges include slang/jargon and emoticon use, which may not be found in the more formal language that is typically used to train language models. A demonstrative example is the severe vulnerabilities discovered with the social media natural language ecosystem—notably with regard to election interference claims by sovereign democracies against other foreign governments, such as the Cambridge Analytica scandal.¹⁵ In addition, the general sense of the worsening of the “fake news” problem has increased interest in the field and has driven discussions of the ethical considerations that should be made when building these systems. All this, coupled with the proliferation of increasingly sophisticated chatbots in a variety of domains, and associated cybersecurity threats, implies that the problem of transfer learning in NLP is poised to continue growing in significance.

Summary

Artificial intelligence (AI) holds the promise of fundamentally transforming our society. To democratize the benefits of this transformation, we must ensure that state-of-the-art advances are accessible to everyone, regardless of language, access to massive computing resources, and country of origin.
Machine learning is the dominant modern paradigm in AI, which, rather than explicitly programming a computer for every possible scenario, trains it to associate input to output signals by seeing many examples of such corresponding input-output pairs.
Natural language processing (NLP), the subfield of AI we will be discussing in this book, deals with the analysis and processing of human natural language data and is one of the most active areas of AI research today.
A recently popularized paradigm in NLP, transfer learning, enables you to adapt or transfer the knowledge acquired from one set of tasks or domains to a different set of tasks or domains. This is a big step forward for the democratization of NLP and, more widely, AI, allowing knowledge to be reused in new settings at a fraction of the previously required resources, which may not be available to everyone.
Key modeling frameworks enabling transfer learning in NLP include ELMo and BERT.
The recent rise in the importance of social media has changed the definition of what is considered natural language. Now, billions of people express themselves online using emoticons, newly invented slang, and deliberately misspelled words. All these present new challenges, which we must take into account when developing new transfer learning techniques for NLP.
Transfer learning is relatively well understood in computer vision, and whenever possible, we should draw on this body of knowledge when experimenting with new transfer techniques for NLP.

1. K. Schwab, The Fourth Industrial Revolution (Geneva: World Economic Forum, 2016).

2. J. Devlin et al., “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding,” arXiv (2018).

3. F. Chollet, Deep Learning with Python (New York: Manning Publications, 2018).

4. T. Mikolov et al., “Efficient Estimation of Word Representations in Vector Space,” arXiv (2013).

5. M. Pagliardini et al., “Unsupervised Learning of Sentence Embeddings Using Compositional n-Gram Features,” Proc. of NAACL-HLT (2018).

6. Q. V. Le et al., “Distributed Representations of Sentences and Documents,” arXiv (2014).

7. I. Sutskever et al., “Sequence to Sequence Learning with Neural Networks,” NeurIPS Proceedings (2014).

8. A. Vaswani et al., “Attention Is All You Need,” NeurIPS Proceedings (2017).

9. X. Zhang et al., “Character-Level Convolutional Networks for Text Classification,” NeurIPS Proceedings (2015).

10. P. Azunre et al., “Semantic Classification of Tabular Datasets via Character-Level Convolutional Neural Networks,” arXiv (2019).

11. M. E. Peters et al., “Deep Contextualized Word Representations,” Proc. of NAACL-HLT (2018).

12. J. Howard et al., “Universal Language Model Fine-Tuning for Text Classification,” Proc. of the 56th Annual Meeting of the Association for Computational Linguistics (2018).

13. J. Devlin et al., “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding,” Proc. of NAACL-HLT (2019).

14. J. Deng et al., “ImageNet: A Large-Scale Hierarchical Image Database,” Proc. of NAACL-HLT (2019).

15. K. Schaffer, Data versus Democracy:- How Big Data Algorithms Shape Opinions and Alter the Course of History (New York: Apress, 2019).