chapter four

4 Model serving patterns

This chapter covers

Using model serving to generate predictions or make inferences on new data with previously trained machine learning models
Handling model serving requests and achieving horizontal scaling with replicated model serving services
Processing large model serving requests using the sharded services pattern
Assessing model serving systems and event-driven design

In the previous chapter, we explored some of the challenges involved in the distributed training component, and I introduced a couple of practical patterns that can be incorporated into this component. Distributed training is the most critical part of a distributed machine learning system. For example, we’ve seen challenges when training very large machine learning models that tag main themes in new YouTube videos but cannot fit in a single machine. We looked at how we can overcome the difficulty of using the parameter server pattern. We also learned how to use the collective communication pattern to speed up distributed training for smaller models and avoid unnecessary communication overhead between parameter servers and workers. Last but not least, we talked about some of the vulnerabilities often seen in distributed machine learning systems due to corrupted datasets, unstable networks, and preempted worker machines and how we can address those problems.

4.1 What is model serving?

4.2 Replicated services pattern : Handling the growing number of serving requests

4.2.1 The problem

4.2.2 The solution

4.2.3 Discussion

4.2.4 Exercises

4.3 Sharded services pattern

4.3.1 The problem: Processing large model serving requests with high-resolution videos

4.3.2 The solution

4.3.3 Discussion

4.3.4 Exercises

4.4.1 The problem: Responding to model serving requests based on events