This chapter covers
- Statistical methods to find outliers in single columns
- More flexible methods based on histograms, kernel density estimation, and nearest neighbors measurements
- Methods to combine scores from multiple statistical tests
- An introduction to multidimensional outliers
In this chapter, we begin to take a look at specific methods to identify outliers. We start with statistical methods, defined here simply as methods that predate machine learning methods and that are based on statistical descriptions of data distributions, such as standard deviations and interquartile ranges. They are designed specifically to find extreme values, the unusually small and large values in sequences of numeric values. These are the easiest outlier tests to understand and provide a good background for the machine learning-based approaches we will focus on later. Statistical methods do have some significant limitations. They work on single columns of data and often don’t extend well to tables. They also often assume specific data distributions, typically that the data is Gaussian, or at least nearly. At the same time, these methods are simpler to understand than methods we will look at later, but they still introduce well some of the complications inherent with outlier detection.