Chapter 14. Training a classifier

 

This chapter covers

  • Extracting features from text
  • Converting features for Mahout’s use
  • Training two Mahout classifiers
  • Selecting from among Mahout’s learning algorithms

This chapter explores the first stage in classification: training the model. Developing a classifier is a dynamic process that requires you to think creatively about the best way to describe the features of your data and to consider how they will be used by the learning algorithm you choose to train your models. Some kinds of data lend themselves readily to classification; others offer a greater challenge, which can be rewarding, frustrating, and interesting all at once.

In this chapter, you’ll learn how to choose and extract features effectively to build a Mahout classifier. Feature extraction involves much more than the simplified steps you saw in the examples in chapter 13. Here we go into the details of feature extraction, including how to preprocess raw data into classifiable data and how to convert classifiable data into vectors that can be used by the Mahout classification algorithms. We use a computational marketing problem as an example to show how training data might be extracted from a database.

Once you understand how to get data ready for classification, in section 14.4 you’ll build a classifier using a standard data set—20 newsgroups—with the Mahout algorithm known as stochastic gradient descent (SGD).

14.1. Extracting features to build a Mahout classifier

14.2. Preprocessing raw data into classifiable data

14.3. Converting classifiable data into vectors

14.4. Classifying the 20 newsgroups data set with SGD

14.5. Choosing an algorithm to train the classifier

14.6. Classifying the 20 newsgroups data with naive Bayes

14.7. Summary