chapter eleven

11 Monitoring Metrics: Precision, Recall, and Pretty Pictures

 

This chapter covers:

  • Defining precision, recall, true/false positives/negatives, how they relate to one another, and what they mean in terms of our model’s performance.
  • A new quality metric, the F1 score, and it’s strengths compared to other possible quality metrics.
  • Updating our logMetrics function to compute and store precision, recall, and F1 score.
  • Balancing our LunaDataset to address the training issues uncovered at the end of chapter 8.
  • Using TensorBoard to graph our quality metrics as each epoch of training occurs, and verifying that our work to balance the data results in an improved F1 score.
Note

MEAP readers, please be aware that some of the code on github for this chapter has been updated, and now performs better than it did when this chapter was first written. In particular, some of the TensorBoard graphs will look different. This will be corrected in future updates.

The close of the last chapter left us in a predicament. While we were able to get the mechanics of our deep learning project in place, none of the results were actually useful; the network simply classified everything as benign! To make matters worse, the results seemed great on the surface, since we were looking at the percent of the training and testing sets that were classified correctly. With our data heavily skewed towards benign samples, blindly calling everything benign is a quick and easy way for our model to score well.

11.1  Good dogs versus bad guys: false positives and false negatives

11.2  Graphing the positives and negatives

11.2.1  Recall

11.2.2  Precision

11.2.3  Implementing precision and recall in logMetrics

11.2.4  Our ultimate performance metric: the F1 score

11.2.5  How does our model perform with our new metrics?

11.3  What does an ideal data set look like

11.3.1  Making the data look less like the actual, and more like the ideal?

11.3.2  Changes to training.py, dset.py to balance benign and malignant samples

11.3.3  Contrasting training with a balanced LunaDataset to previous runs

11.4  Graphing training metrics with TensorBoard

11.4.1  Running TensorBoard

11.4.2  Implementing metrics and tensorboard in code

11.5  Revisiting the problem of over-fitting