chapter seventeen

17 Deploying to production

This chapter covers

Options for deploying PyTorch models
Deploying models with web frameworks and APIs
Optimizing inference performance
Exporting models for various deployment targets
Running exported and natively implemented models from C++

In part 1 of this book, we learned a lot about models; and part 2 left us with a detailed path for creating good models for a particular problem. Now that we have these great models, we need to take them where they can be useful. Maintaining infrastructure for executing inference of deep learning models at scale can be impactful from an architectural as well as cost standpoint. While PyTorch started as a research-focused framework, it has undergone significant evolution, incorporating production-oriented features that make it an ideal end-to-end platform for both research and large-scale production.

What deploying to production means will vary with the use case:

17.1 Serving PyTorch models

17.1.1 Our model served by gradio

17.1.2 Our model behind a FastAPI server

17.1.3 What we want from deployment

17.1.4 Request batching and streaming responses

17.1.5 How to make PyTorch models even faster

17.2 Exporting models

17.2.1 Interoperability beyond PyTorch with ONNX

17.2.2 PyTorch’s own export: torch.export

17.3 Expanding on torch.compile

17.3.1 Full graph capture vs. disjoint graphs

17.4 Understanding execution with torch.profiler

17.5 Using PyTorch outside of Python

17.5.1 LibTorch: PyTorch in C++

17.6 Going mobile: ExecuTorch

17.7 Conclusion

17.8 Exercises

17.9 Summary