4 The outlier detection process

This chapter covers

  • Working on an outlier detection project in production
  • The types of problems we may work with
  • Where outlier detectors are actually the best option
  • Collecting and preparing data as well as fitting the models
  • Evaluating and combining models

We now have a good sense of how outlier detection works generally and how some specific algorithms to identify anomalies work, including statistical and machine learning-based methods. There are, though, a number of steps involved with effectively executing an outlier detection project, which, now that we have a good foundation, we should look at.

In this chapter, we’ll go through the main steps typically involved in outlier detection projects, though they will, of course, vary. If you’re familiar with other areas of machine learning, such as prediction, the steps with outlier detection will be very similar. Each of these steps is important, and each has some subtle points, often a little different than the corresponding steps for prediction projects.

4.1 Outlier detection workflow

The general steps for a fairly typical outlier detection project will, more or less, be

4.2 Determining the types of outliers we are interested in

4.2.1 Statistical outliers

4.2.2 Specific outliers

4.2.3 Known and unknown outliers

4.3 Choosing the type of model to be used

4.3.1 Selecting the category of outlier detector

4.3.2 Rules-based approaches

4.3.3 Classifier-based approaches

4.4 Collecting the data

4.5 Examining the data

4.6 Cleaning the data

4.7 Feature selection

4.8 Feature engineering

4.9 Encoding categorical values

4.10 Scaling numeric values

4.11 Fitting a set of models and generating predictions

4.12 Evaluating the models

4.13 Setting up ongoing outlier detection systems