chapter three

3 Using offline evaluations as diagnostics

This chapter covers

Defining diagnostic offline evaluations in more detail
Illustrating real-world scenarios where diagnostics are useful
Handling the common pitfalls of diagnostic offline evaluations
Engineering considerations to incorporate diagnostic offline evaluations

In Chapter 2, we learned all about some basic offline evaluation principles, including the anatomy of an offline evaluation: data, evaluation design, and metrics. We also introduced the different types of evaluations to ensure everyone knows what's possible. Next, we’ll explore diagnostic evaluations, and unpack their value and applications in the context of machine learning-powered products.

The term diagnostic might sound clinical and unnerving, especially if you’re well familiar with the medical industry, but in the context of evaluating machine learning models, it’s actually more about uncovering hidden gems of insights and understanding why a model behaves the way it does. In fact, it’s closely related to the broader notion of AI explainability: the idea that we should be able to trace, interpret, and understand model outputs, rather than treating them as black boxes.

3.1 Diagnosing model behavior

3.1.1 Evaluating bias in movie recommendations

3.1.2 How diagnostics complement performance offline evaluations

3.1.3 Bridging the gap between model and product level understanding

3.2 Show me the metrics

3.2.1 Error analysis metrics

3.2.2 Fairness and bias evaluation metrics

3.2.3 Robustness testing metrics

3.3 Connecting diagnostics to product impact

3.3.1 How product leads can use diagnostic metrics for decision making

3.3.2 Balancing the product experience

3.4 Practical applications of diagnostic evaluations

3.4.1 Testing model robustness against noise

3.4.2 Diagnosing data distribution shifts

3.4.3 Improving cold-start cases

3.4.4 Addressing long-tail item underrepresentation

3.4.5 Detecting user segment disparities

3.4.6 Making you think about the product experience

3.5 Common challenges

3.6 What makes a good diagnostic offline evaluation?

3.6.1 Have a goal in mind

3.6.2 Have the right data

3.6.3 Make sure you can take action

3.7 Engineering considerations

3.7.1 Fine-grained, detailed, logging

3.7.2 Tooling to improve interpretability

3.7.3 Sensitive data

3.8 Summary