chapter ten

10 A foray into machine learning: logistic regression with PySpark

This chapter covers:

how investing in a solid data manipulation foundation makes data preparation a breeze;
how to address big data quality problems with PySpark;
how to create custom features for your ML model;
how to select features for your model depending on your use-case;
how to assemble the data to get it ready for model training;
how to train and evaluate your ML model.

I get excited doing machine learning, but not for the reasons most people are.

I love getting into a new data set and trying to solve a problem. Each data set sports its own problems and idiosyncrasies and I feel that getting it "ML-ready" is extremely satisfying. Building a model gives purpose to data transformation: you ingest, clean, profile, torture the data for a higher purpose: solving a real-life problem. This chapter takes a "clean-ish" data set and gets it all the way to modeling, leveraging a new corner of PySpark we’ve yet to explore. Just like in a real model, we will spend some time cleaning (and complaining) at our data before building a (pretty decent) first model. The exercises in this chapter will be a little different than what we’ve seen so far in this book: because PySpark provides a very coherent ML API I’ll take advantage of this to let you try different options. I encourage you to crack open the API documentation and try your hand at them. The answer key has your back in any case!

10.1 Reading, exploring and preparing our machine learning data set

10.1.1 Exploring our data and getting our first feature columns

10.1.2 Addressing data mishaps and building our first feature set

10.1.3 Getting our data set ready for assembly: null imputation and casting

10.2 Feature engineering and selection

10.2.1 Weeding out the rare binary occurrence columns

10.2.2 Creating custom features

10.2.3 Removing highly correlated features

10.2.4 Scaling our features

10.2.5 Assembling the final data set with the Vector column type

10.3 Training and evaluating our model

10.3.1 Assessing model accuracy with the Evaluator object

10.3.2 Getting the biggest drivers from our model: extracting the coefficients

10.4 Summary