9 Working with specific data types

 

This chapter covers

  • Handling null values
  • Working with email addresses, phone numbers, and other special data types
  • Working with dates and text data
  • Encoding categorical data
  • Binning and scaling numeric data
  • Distance metrics for numeric and categorical data

Different types of data require handling in different ways. The data we’ve looked at so far has been primarily either numeric or categorical, but much real-world data can be of other types, such as dates or text, and often quite specific data types, such as phone numbers, addresses, URLs, IP addresses, and so on. These provide opportunities for identifying outliers that don’t tend to exist with categorical or numeric data, but to work with them, we do need to process them into a format that outlier detectors can work with, which usually means converting them to numeric or categorical formats.

Once all the features are in these formats, we have decisions related to how to best further preprocess the data, which usually consists of categorical encoding, binning numeric fields, and scaling. We’ve seen examples of these previously. However, it can be tricky to determine the appropriate preprocessing method in most cases. We take a closer look at these here.

9.1 Null values

9.2 Special data types

9.2.1 Phone numbers

9.2.2 Addresses

9.2.3 Email addresses

9.2.4 ID/Code values

9.2.5 Dates

9.2.6 High-cardinality categorical columns

9.3 Text features

9.3.1 Extracting NLP features

9.3.2 Topic modeling

9.3.3 Clustering text values

9.4 Encoding categorical data

9.4.1 One-hot encoding

9.4.2 Ordinal encoding

9.4.3 Count encoding

9.5 Scaling numeric values

9.6 Binning numeric data

9.7 Distance metrics

9.7.1 Distance metrics for numeric data

9.7.2 Gower’s distance metric for mixed data

9.7.3 Distance metrics for categorical data