The task of model training in machine learning is not the exclusive responsibility of researchers and data scientists. Yes, their work on training the algorithms is crucial because they define the model architecture and the training plan. But just like physicists need a software system to control the electron-positron collider to test their particle theories, data scientists need an effective software system to manage the expensive computation resources, such as GPU, CPU, and memory, to execute the training code. This system of managing compute resources and executing training code is known as the model training service.
Building a high-quality model depends not only on the training algorithm but also on the compute resources and the system that executes the training. A good training service can make model training much faster and more reliable and can also reduce the average model-building cost. When the dataset or model architecture is massive, using a training service to manage the distributed computation is your only option.