chapter two

2 Dataset management service

This chapter covers

Understanding dataset management
Using design principles to build a dataset management service
Building a sample dataset management service
Using open source approaches to dataset management

After our general discussion of deep learning systems, we are ready for the rest of the chapters, which focus on specific components in those systems. We present dataset management first not only because deep learning projects are data-driven but also because we want to remind you how important it is to think about data management before building other services.

Dataset management (DM) often gets overlooked in the deep learning model development process, whereas data processing and model training and serving attract the most attention. A common thought in data engineering is that good data processing pipelines, such as ETL (extract, transform, and load) pipelines, are all we need. But if you avoid managing your datasets as your project proceeds, your data collection and dataset consumption logic become more and more complicated, model performance improvement becomes difficult, and eventually, the entire project slows down. A good DM system can expedite model development by decoupling training data collection and consumption; it also enables model reproducibility by versioning the training data.

2.1 Understanding dataset management service

2.1.1 Why deep learning systems need dataset management

2.1.2 Dataset management design principles

2.1.3 The paradoxical character of datasets

2.2 Touring a sample dataset management service

2.2.1 Playing with the sample service

2.2.2 Users, user scenarios, and the big picture

2.2.3 Data ingestion API

2.2.4 Training dataset fetching API

2.2.5 Internal dataset storage

2.2.6 Data schemas

2.2.7 Adding new dataset type (IMAGE_CLASS)

2.2.8 Service design recap

2.3 Open source approaches