chapter eight

8 Scaling out with distributed training

This chapter covers

Understanding distributed data parallel gradient descent
Using gradient accumulation in gradient descent for out-of-memory data sets
Evaluating parameter server versus ring-based approaches for distributed gradient descent
Understanding reduce-scatter and all-gather phases of ring-based gradient descent
Implementing a single node version of ring-based gradient descent using Python

In chapter 7, you learned about scaling up your machine learning implementation to make the most of the compute resources available in a single compute node. For example, you saw examples for taking advantage of the more powerful processors in GPU devices. However, as you will discover by launching a machine learning system in production, the rate of growth in the number of training examples and the size of the training data sets can outpace the compute capacity of even the most capable servers and workstations. Although with contemporary public cloud infrastructure scaling up by upgrading to a more powerful processor or by adding more memory or more GPU devices can get you far, you should have a better plan for the long run.

8.1 What if the training data set does not fit in memory?

8.1.1 Illustrating gradient accumulation

8.1.2 Preparing a sample model and data set

8.1.3 Understanding gradient descent using out-of-memory data shards

8.2 Parameter server approach to gradient accumulation

8.3 Introducing logical ring-based gradient descent

8.4 Understanding ring-based distributed gradient descent

8.5 Phase 1: Reduce-scatter

8.6 Phase 2: All-gather

Summary