chapter nine

9 Working with specific data types

This chapter covers

Handling null values
Working with email addresses, phone numbers, and other special data types
Working with dates and text data
Encoding categorical data
Binning and scaling numeric data
Distance metrics for numeric and categorical data

Different types of data require handling in different ways. The data we’ve looked at so far has been primarily either numeric or categorical, but much real-world data can be of other types, such as dates or text, and often quite specific data types, such as phone numbers, addresses, URLs, IP addresses, and so on. These provide opportunities for identifying outliers that don’t tend to exist with categorical or numeric data, but to work with them, we do need to process them into a format that outlier detectors can work with, which usually means converting them to numeric or categorical formats.

Once all the features are in these formats, we have decisions related to how to best further preprocess the data, which usually consists of categorical encoding, binning numeric fields, and scaling. We’ve seen examples of these previously. However, it can be tricky to determine the appropriate preprocessing method in most cases. We take a closer look at these here.

9.1 Null values

9.2 Special data types

9.2.1 Phone numbers

9.2.2 Addresses

9.2.3 Email addresses

9.2.4 ID/Code values

9.2.5 Dates

9.2.6 High-cardinality categorical columns

9.3 Text features

9.3.1 Extracting NLP features

9.3.2 Topic modeling

9.3.3 Clustering text values

9.4 Encoding categorical data

9.4.1 One-hot encoding

9.4.2 Ordinal encoding

9.4.3 Count encoding

9.5 Scaling numeric values

9.6 Binning numeric data

9.7 Distance metrics

9.7.1 Distance metrics for numeric data

9.7.2 Gower’s distance metric for mixed data

9.7.3 Distance metrics for categorical data