This chapter covers
- Analyzing tabular data to judge which feature engineering techniques are going to help
- Implementing feature improvement, construction, and selection techniques on tabular data
- Using scikit-learn’s Pipeline and FeatureUnion classes to make reproducible feature engineering pipelines
- Interpreting ML metrics in the context of our problem domain to evaluate our feature engineering pipeline
In our first case study, we will focus on the more classic feature engineering techniques that can be applied to virtually any tabular data (data in a classic row and column structure), such as value imputation, categorical data dummification, and feature selection via hypothesis testing. Tabular datasets (figure 3.1) are common, and no doubt, any data scientist will have to deal with tabular data at some point in their careers. There are many benefits to working with tabular data: