6 Leveraging the best practices for machine learning tabular data

This chapter covers

Processing features with more advanced methods
Selecting useful features for lighter, more understandable models
Optimizing hyperparameters to make your models shine in performance
Mastering the specific characteristics and options from GBDTs

In the previous chapter, we discussed decision trees, their characteristics, their limitations, and all their ensemble models, both those based on random resamplings, such as Random Forests, and those based on boosting, such as Gradient Boosting. Since boosting solutions are the actual state of the art in tabular data modeling, we have explained how it works and optimized its predictions at length. In particular, we have presented a couple of solid gradient boosting implementations, XGBoost and LightGBM, that are proving the best solutions available to resort to for a data scientist dealing with tabular data.

6.1 Processing Features

6.1.1 Multivariate missing data imputation

6.1.2 Handling Missing Data with GBDTs

6.1.3 Target encoding

6.1.4 Transforming numerical data

6.2 Selecting Features

6.2.1 Stability Selection for linear models

6.2.2 Shadow Features and Boruta

6.2.3 Forward and backward selection

6.3 Optimizing Hyperparameters

6.3.1 Searching systematically

6.3.2 Leveraging random trials

6.3.3 Reducing the computational burden

6.3.4 Extending your search by Bayesian methods

6.3.5 Manually setting hyperparameters

6.4 Mastering Gradient Boosting

6.4.1 Deciding between XGBoost and LightGBM

6.4.2 Exploring Tree Structures

6.4.3 Speeding up by GBDTs and compiling

6.5 Summary