11 Monitoring and Explainability
This chapter covers
- Setting up monitoring and logging for ML Applications
- Routing alerts using Alertmanager
- Storing logs in Loki for scalable log aggregation and querying
- Identifying data drift
- Using model explainability to understand how the ML model makes its decisions
With the models now available as a service, our next goal would be to monitor these models. Monitoring models is important to ensure the models are working as expected and meeting the criteria agreed upon by the business and the data science teams. Model monitoring can be split up into two main components.
- Basic monitoring
- Data drift monitoring
Basic monitoring refers to ensuring the operational efficiency of the deployed service. Our model services will eventually integrate with other organizational services and must meet any requiredSLAs (service level agreement). Common SLA metrics include uptime, throughput, response latency, and response quality. Most services deployed in a production environment will have a fixed error budget (an acceptable level of unreliability in a service); therefore, maintaining service stability and ensuring a quick reaction to resolve any unforeseen issues is extremely important.