6 Distributed model training: advanced
This chapter covers
- Using Kubeflow’s PyTorchJobs and PyTorch’s Remote Procedure Call framework to train a reinforcement learning agent to balance a beam, across several pods.
- Searching for optimal neural network architectures using Katib.
In the previous chapter, we saw how Kubeflow’s PyTorchJobs and PyTorch’s distributed training framework could be used to train a model across multiple pods to accelerate training. In many cases though, we might want to implement custom parallel algorithms. For example, we might want to train a random forest where each tree is trained on a different worker. Or, we might want to train a reinforcement learning agent that learns to drive a car based on camera images or to learn to play a video game. We might want to experiment with data parallelism that is asynchronous i.e. where each worker makes an update to the weights with its gradients even though the other workers haven’t finished computing their gradients. All these cases require finer-grained control over the workers and data communication.
In this chapter, we will explore the use of Remote Procedure Calls (RPCs) that give us this fine-grained control to implement a custom parallelized training procedure to train a reinforcement learning agent that learns to control a dynamic environment. We will also describe how Katib’s Neural Architecture Search capabilities can be used to automatically search for an optimal neural network architecture.