This chapter covers:
- how investing in a solid data manipulation foundation makes data preparation a breeze;
- how to address big data quality problems with PySpark;
- how to create custom features for your ML model;
- how to select features for your model depending on your use-case;
- how to assemble the data to get it ready for model training;
- how to train and evaluate your ML model.
I get excited doing machine learning, but not for the reasons most people are.
I love getting into a new data set and trying to solve a problem. Each data set sports its own problems and idiosyncrasies and I feel that getting it "ML-ready" is extremely satisfying. Building a model gives purpose to data transformation: you ingest, clean, profile, torture the data for a higher purpose: solving a real-life problem. This chapter takes a "clean-ish" data set and gets it all the way to modeling, leveraging a new corner of PySpark we’ve yet to explore. Just like in a real model, we will spend some time cleaning (and complaining) at our data before building a (pretty decent) first model. The exercises in this chapter will be a little different than what we’ve seen so far in this book: because PySpark provides a very coherent ML API I’ll take advantage of this to let you try different options. I encourage you to crack open the API documentation and try your hand at them. The answer key has your back in any case!