chapter six

6 Gathering datasets

This chapter covers

Data sources
Turning raw data into datasets
Distinguishing data from metadata
Defining how much is enough
Solving the cold start problem
Looking for properties of a healthy data pipeline

In the preceding chapters, we’ve covered the inherent steps in the preparation for building a machine learning (ML) system, including the problem space and solution space, identifying risks, and finding the right loss functions and metrics. Now we will talk about an aspect your ML project simply won’t take off without—datasets. We will compare them with vital elements of our lives. Just like you’ll need fuel to start your car or a nutritious breakfast to get a charge before a busy day at work, an ML system needs a dataset to function properly.

There is an old popular quote about real estate: the three most important things about it are location, location, and location. Similarly, if we were to choose only three things to focus on while building an ML system, those would be data, data, and data. Another classic quote from the computer science world says “garbage in, garbage out,” and we can’t doubt its correctness.

Here we’ll break down the essence of working with datasets, from finding and processing data sources to properly cooking your dataset and building data pipelines. As a culmination of the whole chapter, we will look at datasets as a part of design documents, using the examples of Supermegaretail and PhotoStock Inc.

6.1 Data sources

6.2 Cooking the dataset

6.2.1 ETL

6.2.2 Filtering

6 Gathering datasets

This chapter covers

6.1 Data sources

6.2 Cooking the dataset

6.2.1 ETL

6.2.2 Filtering

6.2.3 Feature engineering

6.2.4 Labeling

6.3 Data and metadata

6.4 How much data is enough?

6.5 Chicken-or-egg problem

6.6 Properties of a healthy data pipeline

6.7 Design document: Dataset

6.7.1 Dataset for Supermegaretail

6.7.2 Dataset for PhotoStock Inc.

Summary