6 Gathering datasets
This chapter covers
- Data sources
- Turning raw data into datasets
- Distinguishing data from metadata
- Defining how much is enough
- Solving the cold start problem
- Looking for properties of a healthy data pipeline
In the preceding chapters, we’ve covered the inherent steps in the preparation for building a machine learning (ML) system, including the problem space and solution space, identifying risks, and finding the right loss functions and metrics. Now we will talk about an aspect your ML project simply won’t take off without—datasets. We will compare them with vital elements of our lives. Just like you’ll need fuel to start your car or a nutritious breakfast to get a charge before a busy day at work, an ML system needs a dataset to function properly.
There is an old popular quote about real estate: the three most important things about it are location, location, and location. Similarly, if we were to choose only three things to focus on while building an ML system, those would be data, data, and data. Another classic quote from the computer science world says “garbage in, garbage out,” and we can’t doubt its correctness.
Here we’ll break down the essence of working with datasets, from finding and processing data sources to properly cooking your dataset and building data pipelines. As a culmination of the whole chapter, we will look at datasets as a part of design documents, using the examples of Supermegaretail and PhotoStock Inc.