2 Getting started with baselines: Data preprocessing

 

This chapter covers

  • Introducing a pair of natural language processing (NLP) problems
  • Obtaining and preprocessing NLP data for such problems
  • Establishing baselines for these problems using key generalized linear methods

In this chapter, we dive directly into solving NLP problems. This will be a two-part exercise, spanning this chapter and the next. Our goal will be to establish a set of baselines for a pair of concrete NLP problems, which we will later be able to use to measure progressive improvements gained from leveraging increasingly sophisticated transfer learning approaches. In the process of doing this, we aim to advance your general NLP instincts and refresh your understanding of typical procedures involved in setting up problem-solving pipelines for such problems. You will review techniques ranging from tokenization to data structure and model selection. We first train some traditional machine learning models from scratch to establish some preliminary baselines for these problems. We complete the exercise in chapter 3, where we apply the simplest form of transfer learning to a pair of recently popularized deep pretrained language models. This involves fine-tuning only a handful of the final layers of each network on a target dataset. This activity will serve as a form of an applied hands-on introduction to the main theme of the book—transfer learning for NLP.

2.1 Preprocessing email spam classification example data

 
 
 

2.1.1 Loading and visualizing the Enron corpus

 
 

2.1.2 Loading and visualizing the fraudulent email corpus

 
 
 

2.1.3 Converting the email text into numbers

 
 

2.2 Preprocessing movie sentiment classification example data

 
 

2.3 Generalized linear models

 
 
 
 

2.3.1 Logistic regression

 
 

2.3.2 Support vector machines (SVMs)

 
 

Summary

 
sitemap

Unable to load book!

The book could not be loaded.

(try again in a couple of minutes)

manning.com homepage