concept categorical feature in category machine learning

This is an excerpt from Manning's book Real-World Machine Learning.
The most common type of non-numerical feature is the categorical feature. A feature is categorical if values can be placed in buckets and the order of values isn’t important. In some cases, this type of feature is easy to identify (for example, when it takes on only a few string values, such as spam and ham). In other cases, whether a feature is a numerical (integer) feature or categorical isn’t so obvious. Sometimes either may be a valid representation, and the choice can affect the performance of the model. An example is a feature representing the day of the week, which could validly be encoded as either numerical (number of days since Sunday) or as categorical (the names Monday, Tuesday, and so forth). You aren’t going to look at model building and performance until chapters 3 and 4, but this section introduces a technique for dealing with categorical features. Figure 2.4 points out categorical features in a few datasets.
Figure 2.4. Identifying categorical features. At the top is the simple Person dataset, which has a Marital Status categorical feature. At the bottom is a dataset with information about Titanic passengers. The features identified as categorical here are Survived (whether the passenger survived or not), Pclass (what class the passenger was traveling on), Gender (male or female), and Embarked (from which city the passenger embarked).
![]()
Some machine-learning algorithms use categorical features natively, but generally they need data in numerical form. You can encode categorical features as numbers (one number per category), but you can’t use this encoded data as a true categorical feature because you’ve then introduced an (arbitrary) order of categories. Recall that one of the properties of categorical features is that they aren’t ordered. Instead, you can convert each of the categories into a separate binary feature that has value 1 for instances for which the category appeared, and value 0 when it didn’t. Hence, each categorical feature is converted to a set of binary features, one per category. Features constructed in this way are sometimes called dummy variables. Figure 2.5 illustrates this concept further.
The pseudocode for converting the categorical features in figure 2.5 to binary features looks like the following listing. Note that categories is a special NumPy type (www.numpy.org) such that (data == cat) yields a list of Boolean values.
In chapter 2, you learned how to work with categorical features. Some ML algorithms work with categorical features directly, but you’ll use the common trick of “Booleanizing” the categorical features: creating a column of value 0 or 1 for each of the possible categories in the feature. This makes it possible for any ML algorithm to handle categorical data without changes to the algorithm itself.
The code for converting all of the categorical features is shown in the following listing.
Figure 6.10. The ROC curve and feature importance list of the random forest model with all categorical variables converted to Boolean (0/1) columns, one per category per feature. The new features are bringing new useful information to the table, because the AUC is seen to increase from the previous model without categorical features.
![]()