This chapter covers
- Testing for unusual duplicates or gaps in the data
- Identifying anomalous entities
- Identifying anomalous time periods
- Finding items that are unusually common, as opposed to unusually rare
- Anomalous trends or distributions of values
Often when analyzing data, we’re interested in finding not just unusual individual records, but any unusual patterns in the data. For this, an important step in outlier detection is searching for what are called collective outliers. These are cases in which individual rows are not necessarily unusual but sets of rows are. For example, in network logs a failed password attempt is likely not unusual, but a large number in a short period would be. With credit card records, a large purchase may not be unusual for the cardholder, but many large purchases in a short period may be very unusual. With collective outlier tests, we identify sets of records that collectively are unusual. In these examples, the set of records related to the failed passwords and the set of records related to the large credit card purchases would, when considered together, form outliers.