chapter eleven

11 Monitoring and Explainability

This chapter covers

Setting up monitoring and logging for ML Applications
Routing alerts using Alertmanager
Storing logs in Loki for scalable log aggregation and querying
Identifying data drift
Using model explainability to understand how the ML model makes its decisions

With the models now available as a service, our next goal would be to monitor these models. Monitoring models is important to ensure the models are working as expected and meeting the criteria agreed upon by the business and the data science teams. Model monitoring can be split up into two main components.

Basic monitoring
Data drift monitoring

Basic monitoring refers to ensuring the operational efficiency of the deployed service. Our model services will eventually integrate with other organizational services and must meet any requiredSLAs (service level agreement). Common SLA metrics include uptime, throughput, response latency, and response quality. Most services deployed in a production environment will have a fixed error budget (an acceptable level of unreliability in a service); therefore, maintaining service stability and ensuring a quick reaction to resolve any unforeseen issues is extremely important.

11.1 Monitoring

11.1.1 Basic Monitoring

11.1.2 Custom Metrics

11 Monitoring and Explainability

This chapter covers

11.1 Monitoring

11.1.1 Basic Monitoring

11.1.2 Custom Metrics

11.1.3 Logging

11.1.4 Alerting

11.2 Data Drift Detection

11.2.1 Object Detection

11.2.2 Movie Recommender

11.3 Explainability

11.3.1 Object Detection

11.3.2 Movie Recommendation

11.4 Looking Back, Moving Forward!

11.5 Summary