chapter two

2 Simple outlier detection

This chapter covers

Statistical methods to find outliers in single columns
More flexible methods based on histograms, kernel density estimation, and nearest neighbors measurements
Methods to combine scores from multiple statistical tests
An introduction to multidimensional outliers

In this chapter, we begin to take a look at specific methods to identify outliers. We start with statistical methods, defined here simply as methods that predate machine learning methods and that are based on statistical descriptions of data distributions, such as standard deviations and interquartile ranges. They are designed specifically to find extreme values, the unusually small and large values in sequences of numeric values. These are the easiest outlier tests to understand and provide a good background for the machine learning-based approaches we will focus on later. Statistical methods do have some significant limitations. They work on single columns of data and often don’t extend well to tables. They also often assume specific data distributions, typically that the data is Gaussian, or at least nearly. At the same time, these methods are simpler to understand than methods we will look at later, but they still introduce well some of the complications inherent with outlier detection.

2.1 One-dimensional numeric outliers

2.1.1 z-score

2.1.2 Interquartile range

2.1.3 Median absolute deviation

2.1.4 Modified z-score

2.1.5 Visualization for numeric outliers

2.1.6 Internal and external outliers

2.1.7 Scoring outliers

2.2 One-dimensional categorical outliers: Rare values

2.3 Multidimensional outliers

2.3.1 Types of multidimensional outliers

2.3.2 Visualization for multidimensional data

2.4 Rare combinations of categorical values

2.4.1 Rare combinations using their absolute count

2.4.2 Combinations that are rare given their marginal probabilities

2.5 Rare combinations of numeric values

2.6 Noise vs. inliers and outliers

2.7 Local and global outliers

2.8 Combining the scores of univariate tests

Summary