13 Putting it all in practice - A real-life example of data engineering and machine learning

Text, application Description automatically generated

This chapter covers

Cleaning up and preprocessing data to make it readable by our model.
Using sklearn to train and evaluate several models.
Using grid search to select good hyperparameters for our model.
Using k-fold cross-validation to be able to use our data for training and validation simultaneously.

Throughout this book you’ve learned some of the most important algorithms in supervised learning, and you’ve had the chance to code them and use them to make predictions on several datasets. However, the process of training a model on real data requires several more steps, and this is what I show you in this chapter.

13.1 The dataset that we’ll use throughout this chapter: The Titanic dataset

13.1.1 The features of our dataset

13.1.2 Using pandas to load the dataset

13.1.3 Using pandas to study our dataset

13.2 Cleaning up our dataset - missing values and how to deal with them

13.2.1 Dropping columns with missing data

13.2.2 How to not lose the entire column - filling in missing data

13.3 Feature engineering - Transforming the features in our dataset before training the models

13.3.1 Turning categorical data into numerical data - One-hot encoding

13.3.2 Turning numerical data into categorical data (and why would we want to do this?) - Binning

13.3.3 Feature selection - Getting rid of unnecessary features

13.4 Training our models

13.4.1 Splitting the data into features and labels, and training and validation

13.4.2 Training several models on our dataset

13.4.3 Which model is better? - Evaluating the models

13.4.4 Testing the model

13.5 Tuning the hyperparameters to find the best model - Grid search

13.6 Using K-fold cross-validation to reuse our data as training and validation

13.7 Summary

13.8 Exercises