This chapter covers
- Using model serving to generate predictions or make inferences on new data with previously trained machine learning models
- Handling model serving requests and achieving horizontal scaling with replicated model serving services
- Processing large model serving requests using the sharded services pattern
- Assessing model serving systems and event-driven design
In the previous chapter, we explored some of the challenges involved in the distributed training component, and I introduced a couple of practical patterns that can be incorporated into this component. Distributed training is the most critical part of a distributed machine learning system. For example, we’ve seen challenges when training very large machine learning models that tag main themes in new YouTube videos but cannot fit in a single machine. We looked at how we can overcome the difficulty of using the parameter server pattern. We also learned how to use the collective communication pattern to speed up distributed training for smaller models and avoid unnecessary communication overhead between parameter servers and workers. Last but not least, we talked about some of the vulnerabilities often seen in distributed machine learning systems due to corrupted datasets, unstable networks, and preempted worker machines and how we can address those problems.