This chapter covers
- Handling null values
- Working with email addresses, phone numbers, and other special data types
- Working with dates and text data
- Encoding categorical data
- Binning and scaling numeric data
- Distance metrics for numeric and categorical data
Different types of data require handling in different ways. The data we’ve looked at so far has been primarily either numeric or categorical, but much real-world data can be of other types, such as dates or text, and often quite specific data types, such as phone numbers, addresses, URLs, IP addresses, and so on. These provide opportunities for identifying outliers that don’t tend to exist with categorical or numeric data, but to work with them, we do need to process them into a format that outlier detectors can work with, which usually means converting them to numeric or categorical formats.
Once all the features are in these formats, we have decisions related to how to best further preprocess the data, which usually consists of categorical encoding, binning numeric fields, and scaling. We’ve seen examples of these previously. However, it can be tricky to determine the appropriate preprocessing method in most cases. We take a closer look at these here.