chapter eight

8 Model serving design

This chapter covers

Clarifying the model serving terminology and challenges
Common model serving approaches
Designing model serving systems for different user scenarios

The simple definition of model serving is the process of executing a model with user input data. Among all the activities in a deep learning system, model serving is the closest to the end customers. All the hard work of dataset preparation, training algorithm development and hyper parameter tuning results in models, and these models get presented to customers by model serving services.

Use speech translation as an example. After training a sequence-to-sequence model for voice translation, in order to let people use this model remotely, the model is usually hosted in a web service, and exposed by a web API. Then we (customers) can send our voice audio file over the web API and get back a translated result: a translated voice audio file. All the model loading and execution happens at the web service backend. Everything included in this user workflow – service, model files and model execution – is called model serving.

8.1 Explaining model serving

8.1.1 What is a deep learning model?

8.1.2 Model prediction and inference

8.1.3 What is model serving?

8.1.4 Model serving terminologies

8.1.5 Model serving challenges

8.2 Common model serving strategies

8.2.1 Direct model embedding

8.2.2 Model service

8.2.3 Model server

8.3 Designing a Prediction service

8.3.1 Single model application

8.3.2 Multi-Tenant application

8.3.3 Supporting multiple applications in one system

8.3.4 Common prediction service requirements

8.4 Summary