chapter eleven

11 Deep Convolutional Neural Network Architectures for Image Classification and Object Detection

11.1 Introduction

Figure 11.1: Is it a bird? Is it a plane? Is it superman?

If a human being is shown the image in Figure 11.1 , (s)he can instantly recognize the objects in it, categorizing them as bird, plane superman. In image classification we want to impart this capability to computers - the ability to recognize objects in an image and classify them into one or more known and pre-determined categories. Apart from identifying the object categories, we can also identify the location of the objects in the image. An object’s location can be described by a bounding box, a rectangle whose sides are parallel to coordinate axes. A bounding box is typically specified by 4 parameters: [(xtl,ytl),(xbr,ybr)], where (xtl,ytl) are the xy coordinates of the top-left corner and (xbr,ybr) are the xy coordinates of the bottom right corner of the bounding box. The problem of identifying and categorizing the objects present in the image is called image classification while if, in addition, we also want to identify their location in the image it is referred to as object detection.

11.2 Convolutional Neural Networks (CNNs) for Image Classification - LeNet

11.2.1 PyTorch: Implementing LeNet for image classification on MNIST

11.3 Towards deeper neural networks

11.3.1 VGG (Visual Geometry Group) Net

11.3.2 Inception: Network in Network paradigm

11.3.3 ResNet: Why simply stacking layers to add depth does not scale

11.3.4 PyTorch Lightning

11.4 Object Detection: A brief history

11.4.1 R-CNN

11.4.2 Fast R-CNN

11.4.3 Faster R-CNN

11.5 Faster R-CNN: A deep dive

11.5.1 Convolution Backbone

11.5.2 Region Proposal Network

11.5.3 Fast R-CNN

11.5.4 Training Faster R-CNN

11.5.5 Other Object Detection Paradigms

11.6 Chapter Summary