Chapter 6. Working with data

 

This chapter covers

  • How to use the tf.data API to train models using large datasets
  • Exploring your data to find and fix potential issues
  • How to use data augmentation to create new “pseudo-examples” to improve model quality

The wide availability of large volumes of data is a major factor leading to today’s machine-learning revolution. Without easy access to large amounts of high-quality data, the dramatic rise in machine learning would not have happened. Datasets are now available all over the internet—freely shared on sites like Kaggle and OpenML, among others—as are benchmarks for state-of-the-art performance. Entire branches of machine learning have been propelled forward by the availability of “challenge” datasets, setting a bar and a common benchmark for the community.[1] If machine learning is our generation’s Space Race, then data is clearly our rocket fuel;[2] it’s potent, it’s valuable, it’s volatile, and it’s absolutely critical to a working machine-learning system. Not to mention that polluted data, like tainted fuel, can quickly lead to systemic failure. This chapter is about data. We will cover best practices for organizing data, how to detect and clean out issues, and how to use it efficiently.

1See how ImageNet propelled the field of object recognition or what the Netflix challenge did for collaborative filtering.

2Credit for the analogy to Edd Dumbill, “Big Data Is Rocket Fuel,” Big Data, vol. 1, no. 2, pp. 71–72.

6.1. Using tf.data to manage data

6.2. Training models with model.fitDataset

6.3. Common patterns for accessing data

6.4. Your data is likely flawed: Dealing with problems in your data

6.5. Data augmentation

Exercises

Summary

sitemap