chapter eight

8 Learning with categorical features

This chapter covers

Introducing categorical features in machine learning
Preprocessing categorical features using supervised and unsupervised encoding
Understanding ordered boosting
Using CatBoost for categorical variables
Handling high-cardinality categorical features

Data sets for supervised machine learning consist of features that describe objects and labels that describe the targets we’re interested in modeling. At a high level, features, also known as attributes or variables, are usually classified into two types: continuous and categorical.

A categorical feature is one that takes a discrete value from a set of finite, nonnumeric values, called categories. Categorical features are ubiquitous and appear in nearly every data set and in every domain. For example:

8.1 Encoding categorical features

8.1.1 Types of categorical features

8.1.2 Ordinal and one-hot encoding

8.1.3 Encoding with target statistics

8.1.4 The category_encoders package

8.2 CatBoost: A framework for ordered boosting

8.2.1 Ordered target statistics and ordered boosting

8.2.2 Oblivious decision trees

8.2.3 CatBoost in practice

8.3 Case study: Income prediction

8.3.1 Adult Data Set

8.3.2 Creating preprocessing and modeling pipelines

8.3.3 Category encoding and ensembling

8.3.4 Ordered encoding and boosting with CatBoost

8.4 Encoding high-cardinality string features