chapter eight

8 Learning with categorical features

 

This chapter covers

  • Introducing categorical features in machine learning
  • Preprocessing categorical features using supervised and unsupervised encoding
  • Understanding ordered boosting
  • Using CatBoost for categorical variables
  • Handling high-cardinality categorical features

Data sets for supervised machine learning consist of features that describe objects and labels that describe the targets we’re interested in modeling. At a high level, features, also known as attributes or variables, are usually classified into two types: continuous and categorical.

A categorical feature is one that takes a discrete value from a set of finite, nonnumeric values, called categories. Categorical features are ubiquitous and appear in nearly every data set and in every domain. For example:

8.1 Encoding categorical features

8.1.1 Types of categorical features

8.1.2 Ordinal and one-hot encoding

8.1.3 Encoding with target statistics

8.1.4 The category_encoders package

8.2 CatBoost: A framework for ordered boosting

8.2.1 Ordered target statistics and ordered boosting

8.2.2 Oblivious decision trees

8.2.3 CatBoost in practice

8.3 Case study: Income prediction

8.3.1 Adult Data Set

8.3.2 Creating preprocessing and modeling pipelines

8.3.3 Category encoding and ensembling

8.3.4 Ordered encoding and boosting with CatBoost

8.4 Encoding high-cardinality string features