12 Setting the stage: Preparing features for machine learning


This chapter covers

  • How investing in a solid data manipulation foundation makes data preparation a breeze
  • Addressing big data quality problems with PySpark
  • Creating custom features for your ML model
  • Selecting compelling features for your model
  • Using transformers and estimators as part of the feature engineering process

I get excited doing machine learning, but not for the reasons most people do. I love getting into a new data set and trying to solve a problem. Each data set sports its own problems and idiosyncrasies, and getting it “ML ready” is extremely satisfying. Building a model gives purpose to data transformation; you ingest, clean, profile, and torture the data for a higher purpose: solving a real-life problem. This chapter focuses on the most important stage of machine learning regarding your use case: exploring, understanding, preparing, and giving purpose to your data. More specifically, we focus on preparing a data set by cleaning the data, creating new features, which are fields that will serve in training the model (chapter 13), and then looking at selecting a curated set of features based on how promising they look. At the end of the chapter, we will have a clean data set with well-understood features that will be ready for machine learning.

12.1 Reading, exploring, and preparing our machine learning data set

12.1.1 Standardizing column names using toDF()

12.1.2 Exploring our data and getting our first feature columns

12.1.3 Addressing data mishaps and building our first feature set

12.1.4 Weeding out useless records and imputing binary features

12.1.5 Taking care of extreme values: Cleaning continuous columns

12.1.6 Weeding out the rare binary occurrence columns

12.2 Feature creation and refinement

12.2.1 Creating custom features

12.2.2 Removing highly correlated features

12.3 Feature preparation with transformers and estimators

12.3.1 Imputing continuous features using the Imputer estimator

12.3.2 Scaling our features using the MinMaxScaler estimator