This chapter covers
- Understanding machine learning operations (MLOps) and its role in production ML
- Key challenges in building reliable ML systems
- How MLOps differs from traditional DevOps
- Building confidence through structured ML processes
In chapter 1, we introduced the ML life cycle and the foundational skills needed to become an effective ML engineer. Now, let’s dig deeper into the machine learning operations (MLOps) practices and principles that will help you reliably deliver value through ML systems. ML and ML models are often not the end product of an organization, but rather a means to an end.
The gap between business value generation, requirements, and necessary infrastructure is the primary reason ML and by extension MLOps are hard. Very few companies truly do research on model development and instead reuse architectures and train/adapt off-the-shelf models for specific domains and problem sets. The availability of comprehensive open source libraries such as Hugging Face also potentially make modeling trivial. After defining a problem and identifying an architecture to solve the problem statement, the hard questions come into focus:
- How will the model be trained?
- How will data get to the model?
- How will the model interact with the other services?
- Where will the model be run?
- How do we make sure the model is accurate over time?