13 Data pipeline

 

This chapter covers

  • Understanding the common types of data formats and storage for training datasets
  • Using TensorFlow TFRecord format and tf.data for dataset representations and transformations
  • Constructing a data pipeline for feeding a model during training
  • Preprocessing using TF.Keras preprocessing layers, layer subclassing, and TFX components
  • Using data augmentation to train models for translational, scale, and viewport invariance

You’ve built your model, using composable models as needed. You’ve trained and retrained it, and tested and retested. Now you’re ready to launch it. In these last two chapters, you’ll learn how to launch a model. More specifically, you’ll migrate a model from the preparation and exploratory phases to a production environment, using the TensorFlow 2.x ecosystem in conjunction with TensorFlow Extended (TFX).

In a production environment, operations such as training and deploying are executed as pipelines. Pipelines have the advantage of being configurable, reusable, version-controlled, and retain history. Because of how extensive a production pipeline is, we need two chapters to cover it. This chapter focuses on the data pipeline components, which make up the frontend of a production pipeline. The next chapter covers the training and deployment components.

Let’s start with a diagram, so you can see the process from start to finish. Figure 13.1 shows an overall view of the basic end-to-end (e2e) production pipeline.

13.1 Data formats and storage

13.1.1 Compressed and raw-image formats

13.1.2 HDF5 format

13.1.3 DICOM format

13.1.4 TFRecord format

13.2 Data feeding

13.2.1 NumPy

13.2.2 TFRecord

13.3 Data preprocessing

13.3.1 Preprocessing with a pre-stem

13.3.2 Preprocessing with TF Extended

13.4.1 Invariance