This chapter covers
- How investing in a solid data manipulation foundation makes data preparation a breeze
- Addressing big data quality problems with PySpark
- Creating custom features for your ML model
- Selecting compelling features for your model
- Using transformers and estimators as part of the feature engineering process
I get excited doing machine learning, but not for the reasons most people do. I love getting into a new data set and trying to solve a problem. Each data set sports its own problems and idiosyncrasies, and getting it “ML ready” is extremely satisfying. Building a model gives purpose to data transformation; you ingest, clean, profile, and torture the data for a higher purpose: solving a real-life problem. This chapter focuses on the most important stage of machine learning regarding your use case: exploring, understanding, preparing, and giving purpose to your data. More specifically, we focus on preparing a data set by cleaning the data, creating new features, which are fields that will serve in training the model (chapter 13), and then looking at selecting a curated set of features based on how promising they look. At the end of the chapter, we will have a clean data set with well-understood features that will be ready for machine learning.