One obvious trend in the deep learning research field is to improve model performance with larger datasets and bigger models with increasingly more complex architecture. But more data and bulkier models have consequences: they slow down the model training process as well as the model development process. As is often the case in computing, performance is pitted against speed. For example, it can cost several months to train a BERT (Bidirectional Encoder Representations from Transformers) natural language processing model with a single GPU.
To address the problem of ever-growing datasets and model parameter size, researchers have created various distributed training strategies. And major training frameworks, such as TensorFlow and PyTorch, provide SDKs that implement these training strategies. With the help of these training SDKs, data scientists can write training code that runs across multiple devices (CPU or GPU) and in parallel.