chapter seven

7 How do you measure classification models? Accuracy and its friends

 
Diagram Description automatically generated

This chapter covers

  • Types of errors a model can make: False positives and false negatives.
  • Putting these errors in a table: The confusion matrix.
  • What are accuracy, recall, precision, F-score, sensitivity, and specificity, and how are they used to evaluate models?
  • What is the ROC curve, and how does it keep track of sensitivity and specificity at the same time?
  • What is the area under the curve (AUC), and how does it evaluate our classification models?

This chapter is slightly different from the previous two, as it doesn’t focus on building classification models; instead, it focuses on evaluating them. For a machine learning professional, being able to evaluate the performance of different models is as important as being able to train them. There are many reasons for this. One is that we seldom train a single model on a dataset; we train several different models and select the one that performs best. Another reason is because we need to make sure models are of good quality before putting them in production. The quality of a model is not always trivial to measure, and in this chapter I teach you several techniques to evaluate your classification model. In chapter 4 you learned how to evaluate regression models, so you can think of this chapter as its analog, but with classification models.

7.1      Accuracy - How often is my model correct?

7.1.1   Two examples of models - Coronavirus and spam email

7.1.2   A super effective yet super useless model

7.2      How to fix the accuracy problem? Defining different types of errors and how to measure them

7.2.1   False positives, false negatives, and which one is worse?

7.2.2   Storing the correctly and incorrectly classified points in a table - the confusion matrix

7.2.3   Recall - Among the positive examples, how many did we correctly classify?

7.2.4   Precision - Among the examples we classified as positive, how many did we correctly classify?

7.2.5   Combining recall and precision as a way to optimize both - The F-score

7.2.6   Recall, precision, or F-scores - Which one should I use?

7.3      A very useful tool to evaluate our model - The receiver operating characteristic (ROC) curve

7.3.1   Sensitivity and specificity - two new ways to evaluate our model (actually only one of them is new)

7.3.2   The receiver operating characteristic (ROC) curve: a way to optimize sensitivity and specificity in a model

7.3.3   A metric that tells us how good our model is - The AUC (area under the curve)

7.3.4   How to make decisions using the ROC curve

7.4      Summary