chapter eight

8 Scaling out with distributed training

 

This chapter covers:

  • Basics of distributed data parallel gradient descent
  • Gradient accumulation in gradient descent for out-of-memory datasets
  • Parameter server vs. ring-based approaches for distributed gradient descent
  • Horovod algorithm including reduce-scatter and all-gather phases

In chapter 7 you learned about scaling up your machine learning implementation in order to make the most of the compute resources available in a single compute node. For example, you saw examples for taking advantage of the more powerful processors in Graphical Processing Unit (GPU) devices. However, as you can discover by launching a machine learning system in production, the rate of growth in the number of training examples and the size of the training datasets can outpace the compute capacity of even the most capable servers and workstations. Although with contemporary public cloud infrastructure, scaling up by upgrading to a more powerful processor, adding more memory, or more GPU devices can get you far, you should have a better plan for the long run. y = x2

8.1 What if the training dataset does not fit in memory?

8.2 Parameter server approach to gradient accumulation

8.3 Intuition behind logical ring-based communication

8.4 Understanding Horovod for distributed gradient descent

8.5 Horovod Phase 1: Reduce Scatter

8.6 Horovod Phase 2: All Gather

8.7 Summary