front matter

 

preface

In recent years, advances in machine learning have made tremendous progress, yet large-scale machine learning remains challenging. Take model training as an example. With the variety of machine learning frameworks such as TensorFlow, PyTorch, and XGBoost, it’s not easy to automate the process of training machine learning models on distributed Kubernetes clusters. Different models require different distributed training strategies, such as utilizing parameter servers and collective communication strategies that use the network structure. In a real-world machine learning system, many other essential components, such as data ingestion, model serving, and workflow orchestration, must be designed carefully to make the system scalable, efficient, and portable. Machine learning researchers with little or no DevOps experience cannot easily launch and manage distributed training tasks.

Many books have been written on either machine learning or distributed systems. However, there is currently no book available that talks about the combination of both and bridges the gap between them. This book will introduce many patterns and best practices in large-scale machine learning systems in distributed environments.

acknowledgments

about this book

Who should read this book?

How this book is organized: A roadmap

About the code

liveBook discussion forum

about the author

about the cover illustration