5 Distributed model training: introduction

 

This chapter covers

  • Understanding and analyzing the training process of neural networks to identify opportunities for parallelism.
  • Using Kubeflow’s PyTorchJobs and PyTorch’s distributed framework to accelerate neural network training by splitting the training dataset across several pods.

In the previous chapter, we saw how Kubeflow Pipelines make it easy to use distributed resources to execute machine learning workflows. In particular, they enable the concurrent execution of independent components like the training of multiple models with different hyperparameters during a grid search. In this chapter and the next, we will dive one step deeper to understand how the task of training a single model itself can be parallelized including data and model parallelism techniques, remote procedure calls and neural network architecture search. This is an involved topic but of great importance as models that are trained on large datasets often require hours or even days of training time and any speedups result in the ability to run more experiments which often leads to finding significantly better models.

5.1 Deep Learning Basics

5.1.1 Training Neural Networks

5.1.2   Parallelization Overview

5.1.3   Kubeflow components for distributed training

5.2 Distributed Data Paradigm

5.3 Summary