chapter three

3 Machine learning-based outlier detection

This chapter covers

An introduction to unsupervised machine learning-based outlier detection
The curse of dimensionality
Some of the broad categories of outlier detection algorithms used
Descriptions and examples of some specific algorithms
The properties of outlier detectors

If you are working on a challenging data problem, such as examining tables of financial data in which you wish to identify fraud, sensor readings that may indicate a need for maintenance, or astronomical observations that may include rare or unknown phenomena, it may be that the statistical techniques we’ve looked at so far are useful but not sufficient to find everything you’re interested in.

We now have a good introduction to outlier detection and can begin to look at machine learning approaches, which allow detection of a much wider range of outliers than is possible with statistical methods. The main factor distinguishing machine learning methods is that the majority, with some exceptions, are multivariate tests: they consider all features and attempt to find unusual records, as opposed to unusual single values. These make more subtle outliers, like fraud, machine failure, or novel telescope readings, much more feasible to detect.

3.1 The curse of dimensionality

3.1.1 Data sparsity

3.1.2 Data appearing in the margins

3.1.3 Distance calculations

3.2 Types of algorithms

3.2.1 Distance based

3.2.2 Density based

3.2.3 Cluster based

3.2.4 Frequent item set based

3.2.5 Model based

3.3 Types of detectors

3.3.1 Clean vs. contaminated training data

3.3.2 Numeric vs. categorical

3.3.3 Local vs. global detectors

3.3.4 Scores vs. flags

3.3.5 The time required for training and predicting

3.3.6 The ability to process many features

3.3.7 The parameters required

Summary