11 Monitoring and Explainability

 

This chapter covers

  • Setting up monitoring and logging for ML Applications
  • Routing alerts using Alertmanager
  • Storing logs in Loki for scalable log aggregation and querying
  • Identifying data drift
  • Using model explainability to understand how the ML model makes its decisions

With the models now available as a service, our next goal would be to monitor these models. Monitoring models is important to ensure the models are working as expected and meeting the criteria agreed upon by the business and the data science teams. Model monitoring can be split up into two main components.

  • Basic monitoring
  • Data drift monitoring

Basic monitoring refers to ensuring the operational efficiency of the deployed service. Our model services will eventually integrate with other organizational services and must meet any requiredSLAs (service level agreement). Common SLA metrics include uptime, throughput, response latency, and response quality. Most services deployed in a production environment will have a fixed error budget (an acceptable level of unreliability in a service); therefore, maintaining service stability and ensuring a quick reaction to resolve any unforeseen issues is extremely important.

11.1 Monitoring

11.1.1 Basic Monitoring

11.1.2 Custom Metrics

11.1.3 Logging

11.1.4 Alerting

11.2 Data Drift Detection

11.2.1 Object Detection

11.2.2 Movie Recommender

11.3 Explainability

11.3.1 Object Detection

11.3.2 Movie Recommendation

11.4 Looking Back, Moving Forward!

11.5 Summary