Chapter 2. Real-world data


This chapter covers

  • Getting started with machine learning
  • Collecting training data
  • Using data-visualization techniques
  • Preparing your data for ML

In supervised machine learning, you use data to teach automated systems how to make accurate decisions. ML algorithms are designed to discover patterns and associations in historical training data; they learn from that data and encode that learning into a model to accurately predict a data attribute of importance for new data. Training data, therefore, is fundamental in the pursuit of machine learning. With high-quality data, subtle nuances and correlations can be accurately captured and high-fidelity predictive systems can be built. But if training data is of poor quality, the efforts of even the best ML algorithms may be rendered useless.

This chapter serves as your guide to collecting and compiling training data for use in the supervised machine-learning workflow (figure 2.1). We give general guidelines for preparing training data for ML modeling and warn of some of the common pitfalls. Much of the art of machine learning is in exploring and visualizing training data to assess data quality and guide the learning process. To that end, we provide an overview of some of the most useful data-visualization techniques. Finally, we discuss how to prepare a training dataset for ML model building, which is the subject of chapter 3.

2.1. Getting started: data collection

2.2. Preprocessing the data for modeling

2.3. Using data visualization

2.4. Summary

2.5. Terms from this chapter



dummy variable A binary feature that indicates that an observation is (or isn’t) a member of a category
ground truth The value of a known target variable or label for a training or test set
missing data imputation Those features with unknown values for a subset of instances Replacement of the unknown values of missing data with numerical or categorical values