chapter eight

8 Learning with Categorical Features

 

This chapter covers

  • An introduction to categorical features in machine learning
  • Preprocessing categorical features using supervised and unsupervised encoding
  • Understanding how ordered boosting works
  • Introducing CatBoost: a powerful ordered boosting framework for categorical variables
  • Handling high-cardinality categorical features

8.1 Hidden heading for figure and table indices (ignore this)

Data sets for supervised machine learning consist of features that describe objects, and labels that describe the targets we are interested in modeling. At a high level, features, also known as attributes or variables, are usually classified into two types: continuous and categorical.

A categorical feature is one that takes a discrete value from a set of finite, non-numeric values, called categories. Categorical features are ubiquitous and appear in nearly every data set and in every domain. For example,

8.2 Encoding Categorical Features

8.2.1 Types of Categorical Features

8.2.2 Ordinal and One-Hot Encoding

8.2.3 Encoding with Target Statistics

8.2.4 The category_encoders Package

8.3 CatBoost: A Framework for Ordered Boosting

8.3.1 Ordered Target Statistics and Ordered Boosting

8.3.2 Oblivious Decision Trees

8.3.3 CatBoost in Practice

8.4 Case Study: Income Prediction

8.4.1 The Adult Census Data Set

8.4.2 Creating Preprocessing and Modeling Pipelines

8.4.3 Category Encoding and Ensembling

8.4.4 Ordered Encoding and Boosting with CatBoost

8.5 Encoding High-Cardinality String Features

8.6 Summary