chapter fifteen

15 Serving and inference optimization

This chapter covers

Challenges that may arise during the serving and inference stage
Tools and frameworks that will come in handy
Optimizing inference pipelines

Making your machine learning (ML) model run in a production environment is among the final steps required for reaching an efficient operating lifecycle of your system. Some ML practitioners demonstrate low interest in this aspect of the craft, preferring instead to focus on developing and training their models. This might be a false move, however, as the model can only be useful if it’s deployed and effectively utilized in production. In this chapter, we discuss the challenges of deploying and serving ML models, as well as review different methods of optimizing the inference process.

15.1 Serving and inference: Challenges

15.2 Tradeoffs and patterns

15.2.1 Tradeoffs

15.2.2 Patterns

15.3 Tools and frameworks

15.3.1 Choosing a framework

15.3.2 Serverless inference

15.4 Optimizing inference pipelines

15.4.1 Starting with profiling

15.4.2 The best optimizing is minimum optimizing

15.5 Design document: Serving and inference

15.5.1 Serving and inference for Supermegaretail

15.5.2 Serving and inference for PhotoStock Inc.

Summary