This chapter covers
- Understanding data ingestion and its responsibilities
- Handling large datasets in memory by consuming smaller datasets in batches (the batching pattern)
- Preprocessing extremely large datasets as smaller chunks on multiple machines (the sharding pattern)
- Fetching and re-accessing the same dataset for multiple training rounds (the caching pattern)
Chapter 1 discussed the growing scale of modern machine learning applications such as larger datasets and heavier traffic for model serving. It also talked about the complexity and challenges of building distributed systems--distributed systems for machine learning applications in particular. We learned that a distributed machine learning system is usually a pipeline of many components, such as data ingestion, model training, serving, and monitoring, and that some established patterns are available for designing each component to handle the scale and complexity of real-world machine learning applications.
All data analysts and scientists should have some level of exposure to data ingestion, either hands-on experience in building a data ingestion component or simply using a dataset from the engineering team or customer. Designing a good data ingestion component is nontrivial and requires understanding the characteristics of the dataset we want to use for building a machine learning model. Fortunately, we can follow established patterns to build that model on a reliable and efficient foundation.