concept YOLO in category deep learning

This is an excerpt from Manning's book Deep Learning for Vision Systems MEAP V08 livebook.
Image classification problems are the most basic applications for CNNs. In which, each image contains only one object and our task is to label these images. But, if we are aiming to reach human levels of understanding, we have to add complexity to these networks so that they are able to recognize multiple objects and their locations in an image. To do that, we are going to build object detection systems like YOLO (You Only Look Once), SSD (Single-Shot Detection), and Faster R-CNN that not only classify images, but can locate and detect each object in images that contain multiple objects. These deep learning systems can look at an image, break it up into smaller regions, and label each region with a class so that a variable number of objects in a given image can be localized and labeled. You can imagine that such a task is a basic prerequisite for applications like autonomous systems.
Similar to the R-CNN family, YOLO (“You Only Look Once”) is a family of object detection networks that improved over the years through the following versions; YOLOv1 published in 2016, YOLOv2 (also known as YOLO9000) published later in 2016, and YOLOv3 published in 2018. The YOLO family of models is a series of end-to-end deep learning models designed for fast object detection, developed by Joseph Redmon, et al. and is considered one of the first attempts to build a fast real-time object detector. It is one of the faster object detection algorithms out there. Though it is no longer the most accurate object detection algorithm, it is a very good choice when you need real-time detection, without loss of too much accuracy.
The creators of YOLO took a different approach than the previous networks. YOLO does not undergo the region proposal step like R-CNNs. Instead, it only predicts over a limited number of bounding boxes by splitting the input into a grid of cells and each cell directly predicts a bounding box and object classification. The result is a large number of candidate bounding boxes that are consolidated into a final prediction using non-maximum suppression.
Figure 7.31: YOLO splits the image into grids, predicts objects for each grid, then use NMS to finalize predictions.
![]()