After our general discussion of deep learning systems, we are ready for the rest of the chapters, which focus on specific components in those systems. We present dataset management first not only because deep learning projects are data-driven but also because we want to remind you how important it is to think about data management before building other services.
Dataset management (DM) often gets overlooked in the deep learning model development process, whereas data processing and model training and serving attract the most attention. A common thought in data engineering is that good data processing pipelines, such as ETL (extract, transform, and load) pipelines, are all we need. But if you avoid managing your datasets as your project proceeds, your data collection and dataset consumption logic become more and more complicated, model performance improvement becomes difficult, and eventually, the entire project slows down. A good DM system can expedite model development by decoupling training data collection and consumption; it also enables model reproducibility by versioning the training data.