chapter eleven

11 Monitoring and explainability

 

This chapter covers

  • Setting up monitoring and logging for ML Applications
  • Routing alerts using Alertmanager
  • Storing logs in Loki for scalable log aggregation and querying
  • Identifying data drift
  • Using model explainability to understand how the ML model makes its decisions

Moving models to production is only the first step - keeping them performing reliably over time requires robust monitoring and understanding of their behavior. In this final chapter, we'll explore how to implement comprehensive monitoring for ML systems and gain insights into their decision-making processes (Figure 11.1).

Figure 11.1 The mental map where we are now focusing on model monitoring(8)

We'll tackle monitoring from two critical angles. First, we'll set up basic operational monitoring to ensure our services meet performance and reliability requirements. Then, we'll implement ML-specific monitoring to detect data drift and track model behavior.

Model monitoring can be split up into two main components.

  • Basic monitoring
  • Data drift monitoring

11.1 Monitoring

11.1.1 Basic monitoring

11.1.2 Custom metrics

11.1.3 Logging

11.1.4 Alerting

11.2 Data Drift detection

11.2.1 Object detection

11.2.2 Movie recommender

11.3 Explainability

11.3.1 Object detection

11.3.2 Movie recommendation

11.4 Looking back, moving forward!

11.5 Summary