chapter two

2 Your first NLP example

This chapter covers

Implementing your first practical NLP application from scratch
Structuring an NLP project from beginning to end
Exploring NLP concepts, including tokenization and text normalization
Applying a machine learning algorithm to textual data

In this chapter, you will learn how to implement your own NLP application from scratch. In doing so, you will also learn how to structure a typical NLP pipeline and how to apply a simple machine-learning algorithm to solve your task. The particular application you will implement is spam filtering. We overviewed it in chapter 1 as one of the classic tasks on the intersection of NLP and machine learning.

2.1 Introducing NLP in practice: Spam filtering

In this book, you use spam filtering as your first practical NLP application, as it exemplifies a widely spread family of tasks—text classification. Text classification comprises several applications that we discuss in this book, including user profiling (chapters 5 and 6), sentiment analysis (chapters 7 and 8), and topic classification (chapter 9), so this chapter will give you a good start. First, let’s see what exactly classification addresses.

We apply classification in our everyday lives pretty regularly: classifying things simply implies that we try to put them into clearly defined groups, classes, or categories. In fact, we tend to classify all sorts of things all the time. Here are some examples:

2.2 Understanding the task

2 Your first NLP example

This chapter covers

2.1 Introducing NLP in practice: Spam filtering

2.2 Understanding the task

2.2.1 Step 1: Define the data and classes

2.2.2 Step 2: Split the text into words

2.2.3 Step 3: Extract and normalize the features

2.2.4 Step 4: Train a classifier

2.2.5 Step 5: Evaluate the classifier

2.3 Implementing your own spam filter

2.3.1 Step 1: Define the data and classes

2.3.2 Step 2: Split the text into words

2.3.3 Step 3: Extract and normalize the features