7 Object detection with R-CNN, SSD, and YOLO


This chapter covers

  • Understanding image classification vs. object detection
  • Understanding the general framework of object detection projects
  • Using object detection algorithms like R-CNN, SSD, and YOLO

In the previous chapters, we explained how we can use deep neural networks for image classification tasks. In image classification, we assume that there is only one main target object in the image, and the model’s sole focus is to identify the target category. However, in many situations, we are interested in multiple targets in the image. We want to not only classify them, but also obtain their specific positions in the image. In computer vision, we refer to such tasks as object detection. Figure 7.1 explains the difference between image classification and object detection tasks.

Figure 7.1 Image classification vs. object detection tasks. In classification tasks, the classifier outputs the class probability (cat), whereas in object detection tasks, the detector outputs the bounding box coordinates that localize the detected objects (four boxes in this example) and their predicted classes (two cats, one duck, and one dog).

7.1 General object detection framework

7.1.1 Region proposals

7.1.2 Network predictions

7.1.3 Non-maximum suppression (NMS)

7.1.4 Object-detector evaluation metrics

7.2 Region-based convolutional neural networks (R-CNNs)

7.2.1 R-CNN

7.2.2 Fast R-CNN

7.2.3 Faster R-CNN

7.2.4 Recap of the R-CNN family

7.3 Single-shot detector (SSD)

7.3.1 High-level SSD architecture

7.3.2 Base network

7.3.3 Multi-scale feature layers

7.3.4 Non-maximum suppression

7.4 You only look once (YOLO)

7.4.1 How YOLOv3 works

7.4.2 YOLOv3 architecture