14 Monitoring and reliability

 

This chapter covers

  • Monitoring as part of machine learning system design
  • Software system health
  • Data quality and integrity
  • Model quality and relevance

Traditional software development is based on a simple principle: a quality-built product will perform with high stability, efficiency, and predictability, and these values will not change over time. In contrast, the world of machine learning (ML) is more complex, and working on an ML system does not end at its release. There is a practical explanation for this; while in the first case, a solution performs strictly within predesigned algorithms, in the second case, functionality is based on a probabilistic model trained on a limited amount of certain input data.

This means that the model will inevitably be prone to degradation over time, as well as experience cases of unexpected behavior, due to the difference between the data it was trained on and the data it will receive in real-life conditions.

These are risks that cannot be eliminated but that you need to be prepared for and able to mitigate so that your system remains effective and valuable to your business over the long term.

In this chapter, we’ll cover the essence of monitoring as part of ML system design and the sources of problems your ML model may encounter as it operates. We will also explore how you want to monitor typical cases in system behavior change and how you need to respond to them.

14.1 Why monitoring is important

14.1.1 Incoming data

14.1.2 Model

14.1.3 Model output

14.1.4 Postprocessing/decision-making

14.2 Software system health

14.3 Data quality and integrity

14.3.1 Processing problems

14.3.2 Data source corruption

14.3.3 Cascade/upstream models

14.3.4 Schema change

14.3.5 Training-serving skew

14.3.6 How to monitor and react

14.4 Model quality and relevance

14.4.1 Data drift

14.4.2 Concept drift

14.4.3 How to monitor

14.4.4 How to react

14.5 Design document: Monitoring