chapter twelve

12 Monitoring Metrics: Precision, Recall, and Pretty Pictures

This chapter covers:

Defining precision, recall, true/false positives/negatives, how they relate to one another, and what they mean in terms of our model’s performance.
A new quality metric, the F1 score, and it’s strengths compared to other possible quality metrics.
Updating our logMetrics function to compute and store precision, recall, and F1 score.
Balancing our LunaDataset to address the training issues uncovered at the end of chapter 8.
Using TensorBoard to graph our quality metrics as each epoch of training occurs, and verifying that our work to balance the data results in an improved F1 score.

Note

MEAP readers, please be aware that some of the code on github for this chapter has been updated, and now performs better than it did when this chapter was first written. In particular, some of the TensorBoard graphs will look different. This will be corrected in future updates.

The close of the last chapter left us in a predicament. While we were able to get the mechanics of our deep learning project in place, none of the results were actually useful; the network simply classified everything as benign! To make matters worse, the results seemed great on the surface, since we were looking at the percent of the training and validation sets that were classified correctly. With our data heavily skewed towards benign samples, blindly calling everything benign is a quick and easy way for our model to score well.

12.1 Good dogs versus bad guys: false positives and false negatives

12.2 Graphing the positives and negatives

12.2.1 Recall

12.2.2 Precision

12.2.3 Implementing precision and recall in `logMetrics`

12.2.4 Our ultimate performance metric: the F1 score

12.2.5 How does our model perform with our new metrics?

12.3 What does an ideal data set look like

12.3.1 Making the data look less like the actual and more like the "ideal"

12.3.2 Changes to training.py, dset.py to balance benign and malignant samples

12.3.3 Contrasting training with a balanced `LunaDataset` to previous runs

12.4 Revisiting the problem of over-fitting

12.4.1 An over-fit face-to-age prediction model

12.4.2 Detecting over-fitting

12.5 Data Augmentation