2 Data ingestion patterns

This chapter covers

Understand what’s involved in data ingestion and what data ingestion is responsible for.
Handle large datasets in memory by consuming datasets by small batches with the batching pattern.
Preprocess extremely large datasets as smaller chunks that are located in multiple machines with the sharding pattern.
Fetch and re-access the same dataset more efficiently for multiple training rounds with the caching pattern.

In the previous chapter, we’ve discussed the growing scale of modern machine learning applications, e.g. larger datasets and heavier traffic for model serving. We’ve also talked about the complexity and challenges in building distributed systems and distributed systems for machine learning applications in particular. We’ve learned that a distributed machine learning system is usually a pipeline of many different components, such as data ingestion, model training, serving, monitoring, etc., where there are some established patterns for designing each individual component to handle the scale and complexity of real-world machine learning applications.

Data ingestion is the first step and an inevitable step in a machine learning pipeline. All data analysts and scientists should have some level of exposure to data ingestion. It could be either hands-on experience in building a data ingestion component or simply using a dataset from the engineering team or customer handed over to them.

2.1 What Is Data Ingestion?

2.1.1 The Fashion-MNIST Dataset

2.2 Batching Pattern: Performing Expensive Operations for Fashion-MNIST Dataset with Limited Memory

2.2.1 Problem

2.2.2 Solution

2.2.3 Discussion

2.2.4 Exercises

2.3 Sharding Pattern: Splitting Extremely Large Dataset among Multiple Machines

2.3.1 Problem

2.3.2 Solution

2.3.3 Discussion

2.3.4 Exercises

2.4 Caching Pattern: Re-accessing Previously Used Data for Efficient Multi-epoch Model Training