6 Leveraging the best practices for machine learning tabular data

 

This chapter covers

  • Processing features with more advanced methods
  • Selecting useful features for lighter, more understandable models
  • Optimizing hyperparameters to make your models shine in performance
  • Mastering the specific characteristics and options from GBDTs

In the previous chapter, we discussed decision trees, their characteristics, their limitations, and all their ensemble models, both those based on random resamplings, such as Random Forests, and those based on boosting, such as Gradient Boosting. Since boosting solutions are the actual state of the art in tabular data modeling, we have explained how it works and optimized its predictions at length. In particular, we have presented a couple of solid gradient boosting implementations, XGBoost and LightGBM, that are proving the best solutions available to resort to for a data scientist dealing with tabular data.

6.1 Processing Features

 
 

6.1.1 Multivariate missing data imputation

 
 

6.1.2 Handling Missing Data with GBDTs

 
 
 
 

6.1.3 Target encoding

 
 

6.1.4 Transforming numerical data

 
 

6.2 Selecting Features

 

6.2.1 Stability Selection for linear models

 
 
 

6.2.2 Shadow Features and Boruta

 
 
 

6.2.3 Forward and backward selection

 

6.3 Optimizing Hyperparameters

 
 

6.3.1 Searching systematically

 
 

6.3.2 Leveraging random trials

 

6.3.3 Reducing the computational burden

 
 
 

6.3.4 Extending your search by Bayesian methods

 
 
 
 

6.3.5 Manually setting hyperparameters

 
 
 
 

6.4 Mastering Gradient Boosting

 
 

6.4.1 Deciding between XGBoost and LightGBM

 
 
 

6.4.2 Exploring Tree Structures

 
 
 
 

6.4.3 Speeding up by GBDTs and compiling

 
 
 

6.5 Summary

 
sitemap

Unable to load book!

The book could not be loaded.

(try again in a couple of minutes)

manning.com homepage
test yourself with a liveTest