chapter thirteen

13 Robust machine learning with ML Pipelines

This chapter covers

Using transformer and estimators to prepare data into ML features
Assembling features into a vector through a ML pipeline
Training a simple ML model
Evaluating a model using relevant performance metrics
Optimizing a model using cross-validation
Interpreting a model’s decision-making process through feature weights

In the previous chapter, we set the stage for machine learning: from a raw data set, we tamed the data and crafted features based on our exploration and analysis of the data. Looking back at the data transformation steps from chapter 12, we performed the following work, resulting in a data frame named food_features.

Read a CSV file containing dishes name and multiple columns as feature candidates.
Sanitized the column names (lowered the case, fixed the punctuation, spacing, and non-printable characters)
Removed illogical and irrelevant records
Filled the null values of binary columns to 0.0
Capped the amounts for calories, protein, fat, and sodium to the 99% percentile
Created ratio features (number of calories from a macro over number of calories for the dish)
Imputed the mean of continuous features.
Scaled continuous features between 0.0 and 1.0.

Tip

If you want to catch up with the code from chapter 12, I included the code leading to food_features in the book’s repository under ./code/Ch12/end_of_chapter.py

13.1 Transformers and estimators: the building blocks of ML in Spark

13.1.1 Data comes in, data comes out: the `Transformer`

13 Robust machine learning with ML Pipelines

This chapter covers

Tip

13.1 Transformers and estimators: the building blocks of ML in Spark

13.1.1 Data comes in, data comes out: the `Transformer`

13.1.2 Data comes in, transformer comes out: the `Estimator`

13.2 Building a (complete) machine learning pipeline

13.2.1 Assembling the final data set with the vector column type

13.2.2 Training an ML model using a `LogisticRegression` classifier

13.3 Evaluating and optimizing our model

13.3.1 Assessing model accuracy: confusion matrix and evaluator object

13.3.2 True positives vs. false positives: the ROC curve.

13.3.3 Optimizing hyper-parameters with cross-validation

13.4 Getting the biggest drivers from our model: extracting the coefficients

13 Robust machine learning with ML Pipelines

This chapter covers

Tip

13.1 Transformers and estimators: the building blocks of ML in Spark

13.1.1 Data comes in, data comes out: the Transformer

13.1.2 Data comes in, transformer comes out: the Estimator

13.2 Building a (complete) machine learning pipeline

13.2.1 Assembling the final data set with the vector column type

13.2.2 Training an ML model using a LogisticRegression classifier

13.3 Evaluating and optimizing our model

13.3.1 Assessing model accuracy: confusion matrix and evaluator object

13.3.2 True positives vs. false positives: the ROC curve.

13.3.3 Optimizing hyper-parameters with cross-validation

13.4 Getting the biggest drivers from our model: extracting the coefficients

13.1.1 Data comes in, data comes out: the `Transformer`

13.1.2 Data comes in, transformer comes out: the `Estimator`

13.2.2 Training an ML model using a `LogisticRegression` classifier