3 Healthcare: Diagnosing COVID-19

 

This chapter covers

  • Analyzing tabular data to judge which feature engineering techniques are going to help
  • Implementing feature improvement, construction, and selection techniques on tabular data
  • Using scikit-learn’s Pipeline and FeatureUnion classes to make reproducible feature engineering pipelines
  • Interpreting ML metrics in the context of our problem domain to evaluate our feature engineering pipeline

In our first case study, we will focus on the more classic feature engineering techniques that can be applied to virtually any tabular data (data in a classic row and column structure), such as value imputation, categorical data dummification, and feature selection via hypothesis testing. Tabular datasets (figure 3.1) are common, and no doubt, any data scientist will have to deal with tabular data at some point in their careers. There are many benefits to working with tabular data:

  • It is an interpretable format. Rows are observations, and columns are features.
  • Tabular data are easy to understand by most professionals, not just data scientists. It is easy to distribute a spreadsheet of rows and columns that can be understood by a breadth of people.
Figure 3.1 Tabular data consist of rows (also known as observations or samples) and columns (which we will often refer to as features).
CH03_F01_Ozdemir

3.1 The COVID flu diagnostic dataset

3.1.1 The problem statement and defining success

3.2 Exploratory data analysis

3.3 Feature improvement

3.3.1 Imputing missing quantitative data

3.3.2 Imputing missing qualitative data

3.4 Feature construction

3.4.1 Numerical feature transformations

3.4.2 Constructing categorical data

3.5 Building our feature engineering pipeline

3.5.1 Train/test splits

3.6 Feature selection

sitemap