3 Model training service

 

This chapter covers

  • Designing principles for building a training service
  • Explaining the deep learning training code pattern
  • Touring a sample training service
  • Using an open source training service, such as Kubeflow
  • Deciding when to use a public cloud training service

The task of model training in machine learning is not the exclusive responsibility of researchers and data scientists. Yes, their work on training the algorithms is crucial because they define the model architecture and the training plan. But just like physicists need a software system to control the electron-positron collider to test their particle theories, data scientists need an effective software system to manage the expensive computation resources, such as GPU, CPU, and memory, to execute the training code. This system of managing compute resources and executing training code is known as the model training service.

Building a high-quality model depends not only on the training algorithm but also on the compute resources and the system that executes the training. A good training service can make model training much faster and more reliable and can also reduce the average model-building cost. When the dataset or model architecture is massive, using a training service to manage the distributed computation is your only option.

3.1 Model training service: Design overview

3.1.1 Why use a service for model training?

3.1.2 Training service design principles

3.2 Deep learning training code pattern

3.2.1 Model training workflow

3.2.2 Dockerize model training code as a black box

3.3 A sample model training service

3.3.1 Play with the service

3.3.2 Service design overview

3.3.3 Training service API

3.3.4 Launching a new training job

3.3.5 Updating and fetching job status

3.3.6 The intent classification model training code