chapter twelve

12 Collective outliers

This chapter covers

Testing for unusual duplicates or gaps in the data
Identifying anomalous entities
Identifying anomalous time periods
Finding items that are unusually common, as opposed to unusually rare
Anomalous trends or distributions of values

Often when analyzing data, we’re interested in finding not just unusual individual records, but any unusual patterns in the data. For this, an important step in outlier detection is searching for what are called collective outliers. These are cases in which individual rows are not necessarily unusual but sets of rows are. For example, in network logs a failed password attempt is likely not unusual, but a large number in a short period would be. With credit card records, a large purchase may not be unusual for the cardholder, but many large purchases in a short period may be very unusual. With collective outlier tests, we identify sets of records that collectively are unusual. In these examples, the set of records related to the failed passwords and the set of records related to the large credit card purchases would, when considered together, form outliers.

12.1 Purchases data

12.2 Preparing the data

12.3 Testing for duplicates

12.4 Testing for gaps

12.5 Testing for missing combinations

12.6 Creating new tables to capture collective outliers

12.6.1 Aggregating by entity

12.6.2 Aggregating by two or more entity types

12.6.3 Aggregating by time

12.6.4 Aggregating by entity and time

12.6.5 Merging in additional information

12.7 Identifying trends

12.8 Unusual distributions

12.9 Rolling windows features

12.10 Tests for unusual numbers of point anomalies

Summary