chapter two

2 Data ingestion patterns

This chapter covers

Understanding data ingestion and its responsibilities
Handling large datasets in memory by consuming smaller datasets in batches (the batching pattern)
Preprocessing extremely large datasets as smaller chunks on multiple machines (the sharding pattern)
Fetching and re-accessing the same dataset for multiple training rounds (the caching pattern)

Chapter 1 discussed the growing scale of modern machine learning applications such as larger datasets and heavier traffic for model serving. It also talked about the complexity and challenges of building distributed systems--distributed systems for machine learning applications in particular. We learned that a distributed machine learning system is usually a pipeline of many components, such as data ingestion, model training, serving, and monitoring, and that some established patterns are available for designing each component to handle the scale and complexity of real-world machine learning applications.

All data analysts and scientists should have some level of exposure to data ingestion, either hands-on experience in building a data ingestion component or simply using a dataset from the engineering team or customer. Designing a good data ingestion component is nontrivial and requires understanding the characteristics of the dataset we want to use for building a machine learning model. Fortunately, we can follow established patterns to build that model on a reliable and efficient foundation.

2.1 What is data ingestion?

2.2 The Fashion-MNIST dataset

2.3 Batching pattern

2 Data ingestion patterns

This chapter covers

2.1 What is data ingestion?

2.2 The Fashion-MNIST dataset

2.3 Batching pattern

2.3.1 The problem: Performing expensive operations for Fashion MNIST dataset with limited memory

2.3.2 The solution

2.3.3 Discussion

2.3.4 Exercises

2.4 Sharding pattern: Splitting extremely large datasets among multiple machines

2.4.1 The problem

2.4.2 The solution

2.4.3 Discussion

2.4.4 Exercises