Welcome to part two of this book, Image Classification and Object Detection. Part one was a foundation on neural networks architectures where we covered Multilayer Perceptrons (MLPs) and Convolutional Neural Networks (CNNs) or Convnets for short. We wrapped up part one with strategies to structure your deep neural network projects and tune their hyperparameters to improve your network performance. In part two, we are going to build on this foundation to develop computer vision systems that solve complex image classification and object detection problems. In chapters 3 and 4, we talked about the main components of convnets and the different hyperparameters setup like the number of hidden layers, learning rate, optimizer, etc. In addition to other techniques to improve the network performance like regularization, augmentation, dropouts, and many more. In this chapter, you will see how all these come together to build an end-to-end convolutional network. I will walk you through five of the most popular CNNs that were state-of-the-art at their times and you will see how the authors of these networks thought about building, training, and improving their networks. We will start with LeNet that was developed in 1998 by and performed fairly well in handwritten character recognition problems. You will then see how CNN architectures have evolved from LeNet to deeper convnets like AlexNet and VGGNet, all the way to more advanced and super deep networks like Inception and ResNet developed in 2014 and 2015 respectively. For each CNN architecture you will learn the following:
To get the most out of this chapter, I encourage you to read the research papers that are linked in each section before you read my explanation. What you have learned in part one of this book fully equips you to start reading research papers written by pioneers in the AI field. Reading and implementing research papers is by far one of the most valuable skill that you will build from reading this book. Are you ready? Let’s get started!
5.1 LeNet-5
In 1998, LeCun et al. introduced in their paper “
Gradient-Based Learning Applied to Document Recognition” a pioneering convolutional neural network called
LeNet-5. The LeNet-5 architecture is straightforward and you have seen all of its components in the previous chapters of this book. It is composed of 5 weight layers, hence the name LeNet-5: 3 convolutional layers + 2 fully connected layers.
What are weight layers?
Mv erfre kr bxr aootulnnoivcl nzb lufyl-ncocenetd yesalr cc gtwehi relsya acubees goqr ntcoani itlrabnae gthewis sz opeospd rx onlpoig erylsa rrsg nue’r niatonc uns giwsthe. Cxd moomnc ncinonotev aj rx oyc rvp renmub el getihw ryseal rx eebdcsir ryo ptdhe el qrv oknrwet. Pxt xepealm, YkfxUxr (lpexiaden nkrk) cj jzzh xr ho 8 aselry uobk eucsaeb rj atsinonc 5 YGUE + 3 LR ysaerl. Rob eorans wv kszt vkmt tabou gwthie salery ja aylmin cusebea rpxh clfetre rbv dlemo maulitpanooct otxlecpimy.
Where C is the CONV layer, S is the subsampling or POOL layer, and FC is the fully connected layer. The building components of the LeNet architecture is not new to you (it was new back in 1998). You have already learned the CONV, POOL, and FC layers in chapter 3. Notice that Yann LeCun and his team used
tanh as an activation function instead of the nowadays state-of-the-art ReLU. This is because back in 1998, ReLU had not been used in the context of deep learning yet and it was more common to use tanh or sigmoid as an activation function in the hidden layers. Without further ado, let’s implement LeNet-5 in Keras.
5.1.2 LeNet-5 implementation in Keras
To implement LeNet-5 in Keras, read the
original paper and follow the architecture information from pages 6, 7 and 8. Here are the main takeaways to build the LeNet-5 network:
LeNet-5 is a small neural network with today’s standards. It has 61,706 parameters compared to millions of parameters in more modern networks as you will see later in this chapter in more modern architectures.
5.1.3 Set up the learning hyperparameters
The authors used a scheduled decay learning where the value of the learning rate was decreasing using the following schedule: 0.0005 for the first two epochs, 0.0002 for the next three epochs, 0.00005 for the next four, then 0.00001 thereafter. In their paper, the authors trained their network for 20 epochs. Let’s build a lr_schedule function with the above schedule. The method will take an integer epoch number as an argument and returns the learning rate (lr).
def lr_schedule(epoch):
# initiate the learning rate with value = 0.0005
lr = 5e-4
# lr = 0.0005 for the first two epochs, 0.0002 for the next three epochs,
# 0.00005 for the next four, then 0.00001 thereafter.
if epoch > 2:
lr = 2e-4
elif epoch > 5:
lr = 5e-5
elif epoch > 9:
lr = 1e-5
return lr
We will then use the lr_schedule function in the code snippet below to compile the model:
When you train LeNet-5 on the MNIST dataset you will get above 99% accuracy (see the code notebook attached to this chapter at
www.computervisionbook.com). Try to re-run this experiment with ‘relu’ activation function in the hidden layers and observe the difference in the network performance.
J dmcornmee sgratint rjwy rux AlexNet perap dlefolwo up rxu VGGNet rpeap paeedxlin jn ruv ngiolflwo otiesscn nbs nurx rqv FoOor pepra bseuace jr zj c rjd radhre xr cvyt rhy jr ja c xvbd slcacsi vvn vnka gqk hk ktvk ord thore naxe.
5.2 AlexNet
We saw how LeNet performed very well on the MNIST dataset. But it turns out that the MNIST dataset is very simple because it contains gray scale images (1 channel) and classified into only 10 classes which makes it a simpler challenge. The main motivation behind AlexNet was to build a deeper network that can learn more complex functions. AlexNet was the winner of the ILSVRC image classification competition in 2012. Alex Krizhevsky, Geoffrey Hinton and Ilya Sutskever created a neural network architecture called ‘AlexNet’ in their paper “
ImageNet Classification with Deep Convolutional Neural Networks”. They trained their network on 1.2 million high-resolution images into 1,000 different classes of the ImageNet dataset. AlexNet was state-of-the-art at its time because it was the first real “deep” network (back then) that opened the door for the computer vision community to seriously consider convolutional networks in their applications. We will explain deeper networks later in this chapter like VGGNet and ResNet, but it is good to see how convnets evolved and the main drawbacks of AlexNet that were the main motivation for the later networks. The AlexNet architecture is shown in the figure below:
Figure 5.4
As you see in the diagram above, AlexNet has a lot of similarities to LeNet but it is much deeper (more hidden layers) and bigger (more filters per layer). They both have similar building blocks of a series of CONV + POOL layers stacked on top of each other followed by FC layers and a Softmax. We’ve seen that LeNet has around 61 thousand parameters whereas AlexNet has about 60 million parameters and 650,000 neurons which gives it a larger learning capacity to understand more complex features. This allowed AlexNet to have a remarkable performance in the ILSVRC image classification competition in 2012.
What is ImageNet and ILSVRC?
JzkumUrx jz s eagrl sauivl edastaba eidgdesn elt zvh nj uilavs cbtjeo eingtroonic awtosfer rhresaec. Jr aj dmaie sr lbgalein nsp gztrniciageo gaiems jnre talosm 22,000 aisgtreeoc aedbs nv c fneidde ora kl osdrw cyn seprhsa. Ydv isgame vvwt eoeltccdl xtlm por wvg psn alelbde yh muahn aelreslb nsugi Ynoamz’c Whanaccile Xtey orincdgsuowrc fxxr. Xr xbr jrvm lk cqjr gwriint, heert tks voet 14 lilimon emisga jn uor JvmqcOor jeroptc. Be naeogirz aygz s vesisma mnaotu lk sprs, ruv orstreac vl JzumoGxr lloodwef grx MtqxGvr erchihray eherw zopc mnauifgeln ewh/spoardr sdiine MtvyKvr cj llcade z “omysnyn vrc” xt “nsstey” tkl rohst. Mitihn rdk JmzkyOrv joceptr, agsmie zto izoneradg rcicodagn re teesh tsesnys, wbjr yvr xfsq gbine er ogso 1,000+ mgeias tgv tsnsye.
AlexNet is consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. You can represent the AlexNet architecture in text as follows:
Before AlexNet, deep learning was starting to gain traction in speech recognition and a few other areas. But AlexNet was the milestone that convinced a lot of the computer vision community to take a serious look at deep learning and demonstrate that deep learning really works in computer vision. Compared to previous CNNs (like LeNet), AlexNet presented some novel features that were not used in previous architectures. You are already familiar of all of them from the previous chapters in this book so it should be quick for us to go through them here.
2.2.1. ReLU activation function:
AlexNet, proposed by Alex Krizhevsky, uses ReLu(Rectified Linear Unit) for the non-linear part, instead of a Tanh or Sigmoid functions that were the earlier standard for traditional neural networks (like LeNet), ReLu was used in the hidden layers of AlexNet architecture because it trains much faster. This is because the derivative of the sigmoid function becomes very small in the saturating region and therefore the updates applied to the weights almost vanish. This phenomenon is called the
vanishing gradient problem. ReLU is represented by this equation f(x) = max(0,x) and is discussed in details in chapter 2.
Mo ffjw rvsf mtvk obaut roq gnhvasnii gtedrain pnohmennoe letra nj yjra patcrhe bwno wk rvfs botau rkg XckDor chtrctueiare.
5.2.2.2 Dropout layer:
as explained in chapter 3, dropout layers are used to avoid the neural network overfitting. The neurons which are “dropped out” do not contribute to the forward pass and do not participate in backpropagation. This means that every time an input is presented, the neural network samples a different architecture, but all these architectures share the same weights. This technique reduces complex co-adaptations of neurons, since a neuron cannot rely on the presence of particular other neurons. It is, therefore, forced to learn more robust features that are useful in conjunction with many different random subsets of the other neurons. The authors used dropout with a probability = 0.5 in the two fully-connected layers.
5.2.2.3. Data augmentation:
one popular and very effective approach to avoid overfitting is to artificially enlarge the dataset using label-preserving transformations. This happens by generating new instances of the training images with some transformations like image rotation, flipping, scaling, and many more. Data augmentation is explained in details in chapter 4.
5.2.2.4. Local response normalization:
in AlexNet, local response normalization is used. It is different from the batch normalization technique (explained in chapter 4). Normalization helps to speed up the convergence. Nowadays, batch normalization (BN) is used instead of using local response normalization and we will be using BN in our implementation in this chapter.
5.2.2.5. Weight regularization:
the authors used a weight decay of 0.0005. Weight decay is another term for the L2 regularization technique explained in chapter 4. It is an approach to reduce the overfitting of a deep learning neural network models on the training data to allow it to generalize better on new data.
The lambda value is the weight decay hyperparameter that you can tune. If you still see overfitting size, increase the lambda value to reduce overfitting. In this case, the authors found that a small decay value of 0.0005 was good enough for the model to learn.
5.2.2.6. Training on multiple GPUs:
the authors used a GTX 580 GPU that has only 3GB of memory. It was state-of-the-art at the time but not large enough to train the 1.2 million training examples in their dataset. Therefore they developed a complicated way to spread their network across two GPUs. The basic idea was that, a lot of these layers were split across two different GPUs and there was a thoughtful way for when the two GPUs would communicate with each other. You don’t need to worry about these details nowadays because there are far more advanced ways to train your deep networks on distributed GPUs that we will discuss later in this book.
5.2.3 AlexNet implementation in Keras
Okay, now that you’ve learned the basic components of AlexNet and the novel features, let’s apply all these together to build the AlexNet neural network. I suggest that you read the architecture description in page 4 in the
original paper and follow along with the next section. As depicted in the figure below, the network contains eight weight layers: the first five are convolutional and the remaining three are fully-connected. The output of the last fully-connected layer is fed to a 1000-way softmax which produces a distribution over the 1000 class labels. AlexNet input starts with 227x227x3 images. If you read the paper, you will notice that it refers to the dimensions volume of 224x224x3 for the input images. But the numbers make sense only for 227x227x3 images. I suggest that this could be a typing mistake in the paper.
Note that all CONV layers are followed by a batch normalization layer and all hidden layers are followed by ReLU activations. Now, let’s put that in code to build the AlexNet architecture:
# Instantiate an empty sequential model
model = Sequential()
# 1st layer (conv + pool + batchnorm)
model.add(Conv2D(filters= 96, kernel_size= (11,11), strides=(4,4), padding='valid',
input_shape = (224,224,3)))
model.add(Activation('relu')) <---- activation function can be added on its own layer or
within the Conv2D function as we did in previous implementations
model.add(MaxPool2D(pool_size=(3,3), strides=(2,2)))
model.add(BatchNormalization())
# 2nd layer (conv + pool + batchnorm)
model.add(Conv2D(filters=256, kernel_size=(5,5), strides=(1,1), padding='same', kernel_regularizer=l2(0.0005)))
model.add(Activation('relu'))
model.add(MaxPool2D(pool_size=(3,3), strides=(1,1)))
model.add(BatchNormalization())
# layer 3 (conv + batchnorm) <--- note that the authors did not add a POOL layer here
model.add(Conv2D(filters=384, kernel_size=(3,3), strides=(1,1), padding='same', kernel_regularizer=l2(0.0005)))
model.add(Activation('relu'))
model.add(BatchNormalization())
# layer 4 (conv + batchnorm) <--- similar to layer 4
model.add(Conv2D(filters=384, kernel_size=(3,3), strides=(1,1), padding='same', kernel_regularizer=l2(0.0005)))
model.add(Activation('relu'))
model.add(BatchNormalization())
# layer 5 (conv + batchnorm)
model.add(Conv2D(filters=256, kernel_size=(3,3), strides=(1,1), padding='same', kernel_regularizer=l2(0.0005)))
model.add(Activation('relu'))
model.add(BatchNormalization())
model.add(MaxPool2D(pool_size=(3,3), strides=(2,2)))
# Flatten the CNN output to feed it with fully connected layers
model.add(Flatten())
# layer 6 (Dense layer + dropout)
model.add(Dense(units = 4096, activation = 'relu'))
model.add(Dropout(0.5))
# layer 7 (Dense layers)
model.add(Dense(units = 4096, activation = 'relu'))
model.add(Dropout(0.5))
# layer 8 (softmax output layer)
model.add(Dense(units = 1000, activation = 'softmax'))
# print the model summary
model.summary()
Model summary
When you print the model summary you will see the number of total parameters = 62 million as follows:
Figure 5.7
A note on LeNet and AlexNet architectures
Ayer ZkDrk bzn YvefDor kdzo cx znmu eaereaprsrmhypt rk nrxq. Ygo taorhus hsq rx ep hthourg ze nqmc strmepinexe rk vrz xrd niserezkle_, isedrts, nqs aipgndd txl posz relay hwhci masek jr erradh er serndtnuda nbs aaegnm. ENUGvr (exiepnlad evnr) esvlos rjau meblpro jwrp c ktxb imspel gzn imunrfo rciaeechutrt.
5.2.4 Set up the learning hyperparameters
AlexNet was trained for 90 epochs which took 6 days simultaneously on two Nvidia Geforce GTX 580 GPUs. This is the reason for why you will see that their network is split into two pipelines in the original paper. The authors started with an initial learning rate 0.01 with a momentum of 0.9. The “lr” is then divided by 10 when the the validation error stops improving.
# reduce learning rate by 0.1 when the validation error plateaus
reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=np.sqrt(0.1))
# set the SGD optimizer with lr of 0.01 and momentum of 0.9
optimizer = keras.optimizers.sgd(lr = 0.01, momentum = 0.9)
# compile the model
model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])
# train the model
# call the reduce_lr value using callbacks in the training method
model.fit(X_train, y_train, batch_size=128, epochs=90, validation_data=(X_test, y_test),
verbose=2, callbacks=[reduce_lr])
5.2.5 AlexNet performance on CIFAR dataset
AlexNet significantly outperformed all the prior competitors in the ILSVRC challenges and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry of that year that uses other traditional classifiers. This huge improvement in performance attracted the computer vision community’s attention to the potential that convolutional networks have to solve complex vision problems and lead to more advanced CNN architectures as you will see in the following sections of this chapter.
VGGNet was developed by the Visual Geometry Group at Oxford University, hence the name VGG, in 2014 by Karen Simonyan and Andrew Zisserman. It was introduced in their paper “
Very Deep Convolutional Networks for Large-Scale Image Recognition”. The building components of VGGNet are exactly the same as LeNet and AlexNet except that it is an even deeper network with more convolutional, pooling, and dense layers. Other than that, there are no new components that are introduced here. VGGNet, also known as VGG-16, consists of 16 weight layers; 13 convolutional layers + 3 fully-connected layers. It’s uniform architecture makes it very appealing in the deep learning community because it is very easy to understand.
5.3.1 Novel features of VGGNet
We’ve seen how challenging it can be to set up the CNN hyperparameters like kernel size, padding, strides, etc. VGGNet novel concept is that it designed a simple architecture that contains uniform components (CONV and POOL layers). It makes the improvement over AlexNet by replacing large kernel-sized filters (11 and 5 in the first and second convolutional layers, respectively) with multiple 3x3 pool-size filters one after another. The architecture is composed of a series of uniform CONV building blocks followed by a unified POOL layer where:
The authors of VGGNet decided to use a smaller 3X3 kernels to allow the network to extract finer level features of the image compared to AlexNet’s large kernels of 11x11 and 5x5 kernels. The idea is that with a given convolutional receptive field, multiple stacked smaller size kernel is better than the one with a larger size kernel because multiple non-linear layers increases the depth of the network which enables it to learn more complex features at a lower cost because it has lower number of learning parameters. For example: In their experimentations, the authors noticed that a stack of two 3×3 conv. layers (without spatial pooling in between) has an effective receptive field of 5×5 and three 3x3 conv. layers have the effect of a 7x7 receptive field. So by using 3x3 convolutions with higher depth you get the benefits using more nonlinear rectification layers (relu) which makes the decision function more discriminative. Secondly, this decreases the number of training parameters because when you use three-layer 3x3 conv. with C channels, the stack is parameterised by
3 (
32C2)=27 C2 weights compared to a single
7x7 conv. layer which requires
72C2=
49C2weights which is 81% more parameters.
This unified configuration of the CONV and POOL components, simplifies the neural network architecture which makes it very easy to understand and implement. VGGNet architecture is developed by stacking 3x3 convolutional layers with 2x2 pooling layers inserted after several CONV layers. This is followed by the traditional classifier that is composed of some fully-connected layers and a softmax as depicted in the figure below:
Figure 5.9
5.3.2 VGGNet Configurations
The authors created several configurations for the VGGNet architecture as you see in the table below. All configurations follow the same generic design. Configurations D and E are the most commonly used and referred to as VGG-16 and VGG-19 referring to the number of weight layers. Each block contains a series of 3x3 convolutional layers with similar hyperparameters configuration and followed by a 2x2 pooling layer.
Figure 5.10
In the table below you will see the number of learning parameters in millions for each configuration. VGG-16 yields ~138 million parameters and VGG-19 is a deeper version of VGGNet that has more than 144 million parameters. VGG-16 is more commonly used because it performs almost as well as VGG-19 but with fewer parameters.
Figure 5.11
5.3.3 VGG-16 in Keras
Configurations D (VGG-16) and E (VGG-19) are the most commonly used configurations because they are deeper networks that can learn more complex functions. So, in this chapter we will implement configuration D of the VGGNet that has 16 weight layers. VGG-19 (Configuration E) can be similarly implemented by just adding a fourth CONV layer to the third, fourth, and fifth blocks as you can see in the above table. You can see the notebooks attached to this chapter for a full implementation of both VGG-16 and VGG-19 at
www.computervisionbook.com. Note that the authors used the following regularization techniques to avoid overfitting:
When you print the model summary you will see the number of total parameters ~ 138 million as follows:
Figure 5.12
5.3.4 Learning hyperparameters
The authors followed a similar training procedure to AlexNet. Namely, the training is carried out using mini-batch gradient descent SGD with momentum = 0.9. The learning rate was initially set to 0.01 , and then decreased by a factor of 10 when the validation set accuracy stopped improving.
5.3.5 VGGNet performance on CIFAR dataset
VGG-16 achieved the top-5 error rate of 8.1% on ImageNet compared to 15.3% achieved by AlexNet. Even better accuracy with the VGG-19 where it was able to achieve a top-5 error rate of ~ 7.4%. It is worth noting that, in spite of the larger number of parameters and the greater depth of VGGNet compared to AlexNet, VGGNet required less epochs to converge due to implicit regularisation imposed by greater depth and smaller convolutional filter sizes
5.4 Inception and GoogLeNet
The Inception network came to the world in 2014 when a group of researchers at Google published their paper “
Going Deeper with Convolutions”. The main hallmark of this architecture is building a deeper neural network while improving the utilization of the computing resources inside the network. One particular incarnation of the Inception network is called GoogleNet and was used in their submission for ILSVRC14. It uses a 22 layers deep network which is deeper than VGGNet while reducing the number of parameters 12 times fewer (from ~138 million to ~13 million) while achieving significantly more accurate results. The network used a CNN inspired by the classical networks (AlexNet and VGGNet) but implemented a novel element which is dubbed as the
Inception Module.
5.4.1 Novel features of Inception
The authors of the Inception network took a different approach when designing their network architecture. As we’ve seen in the previous networks, there are some architectural decisions that you need to make for each layer when you are designing your network. Decisions like:
Configuring the kernel size and positioning the pool layers are decisions that you need to make mostly by trial and error and experiment with to get the optimal results. Inception says: instead of choosing a desired filter size in a CONV layer and where to place the pooling layers, let’s apply all of them all together in one block and call it the “
Inception Module”. Instead of stacking layers on top of each other like in classical architectures, the authors suggest that we create an “inception module” that consists of several convolutional layers with different kernel size. The architecture is then developed by stacking the inception modules on top of each other. Let’s take a look at how classical convolutional networks are architected vs. the Inception network:
Figure 5.13
From the above diagram, you can observe the following:
Jn sacsaillc uciteechrstra jfvv ZvOxr, YfokOro, zgn PKKGrx, kw tscak lnioalocunvto nys nlopogi raeysl nv qer lv xuaz reoht rv udbli xru ruefaet tsterrocxa. Tr rvu ynx xw shq rod denes PA esaryl re iudlb pro scferilisa.
We’ve been treating the inception modules as black boxes to understand the bigger picture of the inception architecture. Now, we will unpack the inception module to understand how it works.
5.4.2 Inception module - naive version
The Inception module is a combination of four layers:
1×1 Runivloooanlt elyra,
3×3 Rlnuinlavooot reayl,
5×5 Aloatuviolnon ylaer, cun
3k3 ecm-pnoogil elary
The outputs of these layers are then concatenated into a single output volume forming the input of the next stage. The naive representation of the inception module is represented in the figure below:
Figure 5.14
The diagram might look a little overwhelming but the idea is simple to understand. Let’s follow along with this example:
5.4.3 Inception module with dimensionality reduction
The naive representation of the inception module that we just saw has a big computational cost problem that comes with processing larger filters like the 5x5 convolutional layer. To get a better sense of the compute problem with the naive representation, let’s calculate the number of operations that will be performed for the 5x5 CONV layer in the previous example. The input volume with dimensions of 32x32x200 will be fed to the 5x5 conv of 32 filters with dimensions = 5x5x32. This means that total number of multiplies that the computer needs to compute is = 32x32x200 multiplied by 5x5x32 which is more than 163 million operations. While we can do this much of operations in modern computers, but this is still a pretty expensive one. This is when the dimensionality reduction layers can be very useful.
The 1x1 convolutional layer can reduce the operational cost of 163 million operations to about a tenth of that. That is why it is given the name “
reduce layer”. The idea here is to add a 1x1 CONV layer before the bigger kernels like the 3x3 and 5x5 CONV to reduce their depth which in turn will reduce the number of operations. Let’s look at the example below: Suppose we have an input dimension volume of 32 x 32 x 200. We then add a 1x1 CONV with depth = 16. This will reduce the dimension volume from 200 to 16 channels. We can then apply the 5x5 CONV on the output that has much less depth.
Figure 5.15
Notice that the input of 32x32x200 is processed through the two conv layers and outputs a volume of dimensions 32x32x32 which is the same dimension that we produced before without applying the dimensionality reduction layer. But what we've done here is, instead of processing the 5x5 conv layer on the entire 200 channels of the input volume, we're taking this huge volume and shrunk its representation to a much smaller intermediate volume which only has 16 channels. Now, let’s look at the computational cost involved in this operation and compare it to the 163 million multiplications that we got before applying the reduce layer. Computation = operations in the 1x1 convolution + operations in the 5x5 convolution
= 32x32x200 multiplied by 1x1x16 + 32x32x16 multiplied by 5x5x32
= 3.2 million + 13.1 million
Total number of multiplications in this operation = 16.3 millions which is tenth of the 163 million multiplications that we calculated earlier without the reduce layers.
What is the 1x1 convolutional layer?
Rky jvqc el oyr 1o1 XQDZ ylera ja zrqr jr sereeprvs rdk slitapa osesdinnmi (H & M) lv kdr ntiup levuom grg hnaesgc rpx bmnure lx snchlane lx rqo elomvu (hpdet).
Figure 5.16
Aky 1o1 ovnociltulano earysl zot xsfz knwno zs “tonbktleec srylae”. Rxd yglanao cj cmyk eabcsue brv eobleckttn ja rbv lsesltma crqt xl prx betlot bsn bro uecder earysl deeruc qrx amtsediinloniy lk rbv tornekw ganmki rj fxxv xfxj c tntcokeleb.
Figure 5.17
What is the impact of dimensionality reduction on the network performance? Now you might be wondering, does shrinking down the representation size so dramatically hurt the performance of the neural network? The authors ran experimentations and found out that as long as you implement this reduce layer with moderation, you can shrink down the representation size significantly without hurting the performance and saves a lot of computation. Now let’s put the reduce layers in action and build the new
inception module with dimensionality reduction. To do that, we will keep the same concept of concatenating all the 4 layers that we had from the naive representation. We will add a 1x1 convolutional reduce layer before the 3x3 and 5x5 convolutions to reduce their computational cost. We will also add a 1x1 conv after the 3x3 max-pooling layer because pooling layers don’t reduce the depth for their inputs. So, we will need to apply the reduce layer to their output before we do the concatenation. See the diagram below:
Figure 5.18
We add dimensionality reduction prior to bigger convolutional layers to allow for increasing the number of units at each stage significantly without an uncontrolled blow-up in computational complexity at later stages. Furthermore, the design follows the practical intuition that visual information should be processed at various scales and then aggregated so that the next stage can abstract features from the different scales simultaneously.
Mk vnqr tyn er rqk olpmebr el auciomttnapol reac rprz emocs wruj gunsi glrea cjax efrilst. Jn tvyx wo pak z 1k1 ulooalinonvtc yearl lelcda rop dcereu aerly urcr edseucr drv ocaalotpimtun ezrc inicfnglyiast. Mo sph rpo eurdce laeyrs ebfero vrd 3o3 ncg 5o5 ocvlnulaionot sarlye nsb ftare vrp moc-poiongl ylaer vr certae rpv etniipocn oeumdl wujr maisentldniyio drnetiocu.
5.4.4 Inception architecture
Okay, now that we understand the components of the inception module, we are ready to build the inception network architecture. In here, we are going to use the dimension reduction representation of the inception module and simply stack them on top of each other and add a 3x3 pooling layer in between for downsampling as you can see in the figure below.
Figure 5.19
In the figure above, we stacked two inception modules with a pooling layer in between. We can stack as many inception modules as we want to build a very deep convolutional network. In the original paper, the authors built a specific incarnation of the inception module and called it
GoogLeNet. They used this network in their submission for the ILSVRC 2014 competition. The GoogLeNet architecture is depicted in the diagram below:
Figure 5.20 Full GoogLeNet Model
As you can see in the diagram above, GoogLeNet uses a stack of a total of 9 inception blocks and max pooling layer after every several blocks to reduce the dimensionality. To simplify this implementation, we are going to breakdown the GoogLeNet architecture into three parts A, B, and C:
Zrzt B: caltniide rv XefoDro bnc EoDor uratichetresc werhe jr soatcnin s sesrei lx YNGP snh VQDV ryasel
Now, let’s implement GoogleNet architecture in Keras. First we build the inception module function to use it in our architecture:
Figure 5.21
Notice that the inception module takes the features from the previous module as an input, passes it through 4 routes, then concatenate the depth of the output of all 4 routes, then pass the concatenated output to the next module. The four routes are as follows:
1o1 ksnk
1k1 okns + 3e3 ozkn
1o1 snkk + 5e5 sxen
3o3 kkqf + 1e1 nske
Now, let’s build the
inception_module function. The function takes the number of filters of each convolutional layer as an argument and returns the concatenated output.
Now that we have the
inception_module function ready, let’s build the GoogLeNet architecture that we explained in the previous diagram. To get the values of the inception_module arguments, we are going to go through the table below that represents the hyperparameters setup as implemented by the authors in the original paper “Going Deeper with Convolutions paper”.
Figure 5.22
Note that “#3×3 reduce” and “#5×5 reduce” in the table below represent the 1×1 filters in the reduction layers that are used before the 3×3 and 5×5 convolutions. Now, let’s go through the implementations of parts A, B, and C.
Part A: build the bottom part of the network
Let’s build the bottom part of the network. This part consists of: 7x7 CONV > 3x3 POOL > 1x1 CONV > 3x3 CONV > 3x3 POOL as you can see in the diagram below.
Figure 5.23
LocalResponseNorm layer: similar to AlexNet, a local response normalization is used. It is a normalization technique that helps speed up the convergence. Nowadays, batch normalization (BN) is used instead of using local response normalization and we will be using BN in our implementation in this chapter.
# input layer with size = 24x24x3
input_layer = Input(shape=(224, 224, 3))
x = Conv2D(64, (7, 7), padding='same', strides=(2, 2), activation='relu', name='conv_1_7x7/2', kernel_initializer=kernel_init, bias_initializer=bias_init)(input_layer)
x = MaxPool2D((3, 3), padding='same', strides=(2, 2), name='max_pool_1_3x3/2')(x)
x = BatchNormalization()(x)
x = Conv2D(64, (1, 1), padding='same', strides=(1, 1), activation='relu')(x)
x = Conv2D(192, (3, 3), padding='same', strides=(1, 1), activation='relu')(x)
x = BatchNormalization()(x)
x = MaxPool2D((3, 3), padding='same', strides=(2, 2))(x)
x = inception_module(x, filters_1x1=192, filters_3x3_reduce=96, filters_3x3=208, filters_5x5_reduce=16, filters_5x5=48, filters_pool_proj=64, name='inception_4a')
x = inception_module(x, filters_1x1=160, filters_3x3_reduce=112, filters_3x3=224, filters_5x5_reduce=24, filters_5x5=64, filters_pool_proj=64, name='inception_4b')
x = inception_module(x, filters_1x1=128, filters_3x3_reduce=128, filters_3x3=256, filters_5x5_reduce=24, filters_5x5=64, filters_pool_proj=64, name='inception_4c')
x = inception_module(x, filters_1x1=112, filters_3x3_reduce=144, filters_3x3=288, filters_5x5_reduce=32, filters_5x5=64, filters_pool_proj=64, name='inception_4d')
x = inception_module(x, filters_1x1=256, filters_3x3_reduce=160, filters_3x3=320, filters_5x5_reduce=32, filters_5x5=128, filters_pool_proj=128, name='inception_4e')
x = MaxPool2D((3, 3), padding='same', strides=(2, 2), name='max_pool_4_3x3/2')(x)
Now, let’s create modules 5a and 5b:
x = inception_module(x, filters_1x1=256, filters_3x3_reduce=160, filters_3x3=320, filters_5x5_reduce=32, filters_5x5=128, filters_pool_proj=128, name='inception_5a')
x = inception_module(x, filters_1x1=384, filters_3x3_reduce=192, filters_3x3=384, filters_5x5_reduce=48, filters_5x5=128, filters_pool_proj=128, name='inception_5b')
Part D: the classifier part
In their experimentations, the authors found that adding an 7x7 average pooling improved the top-1 accuracy by about 0.6%. Then added a dropout layer with 40% probability to reduce overfitting.
x = AveragePooling2D(pool_size=(7,7), strides=1, padding='valid')(x)
x = Dropout(0.4)(x)
x = Dense(10, activation='softmax', name='output')(x)
5.4.6 Learning hyperparameters
The authors used stochastic gradient descent optimizer with 0.9 momentum. They also implemented a fixed learning rate decay schedule of 4% every 8 epochs.
GoogleNet is the winner of the ILSVRC 2014 competition. It achieved a top-5 error rate of 6.67% which was very close to human level performance and much better than the previous CNNs like AlexNet and VGGNet.
5.5 ResNet
Residual Neural Network (ResNet) was developed in 2015 by Kaiming He et al, from the Microsoft Research team, in their paper “
Deep Residual Learning for Image Recognition”. They introduced a novel architecture with “skip connections” called
residual module. The network also features heavy batch normalization for the hidden layers. This technique allowed the authors to train a very deep neural networks with with 50, 101, and 152 weight layers while still having lower complexity than smaller networks like VGGNet (19 layers). ResNet was able to achieve a top-5 error rate of 3.57% in the ILSVRC15 which beats the performance of all prior convnets.
5.5.1 Novel features of ResNet
Looking at how neural network architectures evolved from LeNet, AlexNet, VGGNet, and Inception you might have noticed that the deeper the network, the larger learning capacity it has, and the better it extracts features from images. This mainly happens because very deep networks are able to represent very complex functions that allows the network to learn features at many different levels of abstraction, from edges (at the lower layers) to very complex features (at the deeper layers). Earlier in this chapter, we saw deep neural networks like VGGNet-19 that contains 19 layers and GoogLeNet that contains 22 layers. Both have performed very well in the ImageNet challenge. But can we build even deeper networks? We learned from chapter 4 that one downside from adding too many layers is that it makes the network more prone to overfit the training data. This is not a big problem because there are many regularization techniques that we learned in chapter 4 that we can use to avoid overfitting like dropout, L2 regularization, and batch normalization. So, if we take care of the overfitting problem, wouldn’t we want to build very deep networks that are 50, 100, or even 150 layers deep? The answer is, Yes. We definitely should try to build very deep neural networks. Only one other problem that we need to fix to unblock the capability of building super deep networks for us. It is a phenomenon that is called the
vanishing gradients.
To solve the vanishing gradient problem, the authors created a shortcut that allows the gradient to directly backpropagated to earlier layers. These shortcuts are called “
skip connections”. The skip connections are used to flow information from earlier layers in the network to later layers creating an alternate shortcut path for the gradient to flow through. Another important benefit of the skip connections is that they allow the model to learn an identity function which ensures that the layer will perform at least as good as the previous layer.
Figure 5.24
The figure on the left is the traditional stacking of convolution layers together one after the other. On the right we still stack convolution layers as before but we now also add the original input to the output of the convolution block. This is called skip connection. We then add both signals; the skip connection + the main path. Note that the shortcut arrow is pointing to the end of the second convolutional layer.
Not after. The reason is that you will do the addition of both paths before you apply the activation function ‘relu’ of this layer. It goes like this:
Figure 5.25
As you can see in the figure above, the X signal is passed along the shortcut path and then added to the main path f(x). Then, we apply the relu activation to f(x) + x to produce the output signal = relu( f(x) + x ) The code implementation of the skip connection is straight forward. Look at the code snippet below:
# You first store the value of the shortcut to be equal to the input x
X_shortcut = X
# Then perform the main path operations: CONV+ReLU + CONV
X = Conv2D(filters = F1, kernel_size = (3, 3), strides = (1,1))(X)
X = Activation('relu')(X)
X = Conv2D(filters = F1, kernel_size = (3, 3), strides = (1,1))(X)
# Then add the both paths together
X = Add()([X, X_shortcut])
# and finally, apply the relu activation function
X = Activation('relu')(X)
This combination of the skip connection and convolutional layers is called the
residual block. Similar to the Inception network, ResNet is composed of a series of building blocks that are stacked on top of each other. These building blocks are called residual blocks.
Figure 5.26
From the above diagram, you can observe the following:
Ptueaer ettarcsoxr: rx dbuli ord earfetu rxtctoeasr tsdr el rkb TvaQrv, wk tsatr rjwg s BDGP + FGDV eryla, xunr asckt ilreadus klbocs nk vrh lx abvs rhoet kr ibdul org nkweotr. Mnvd ow tkz genindigs pkt BcxDvr rnweokt, wx nsa pbs cz dzmn luesraid sbolck cc wk srnw re ilubd xnox prdeee koewtnrs.
Tsesfailris: vur toafcsiislacni qcrt ja siltl rob mcos ac xw elednra nj oreth tsrnkoew. Zfuhf-ceennctod esalyr delolwof pd s ofmstxa.
Now that you know what a skip connection is and you are familiar with the high-level architecture of ResNets, let’s unpack the residual blocks to understand how they work.
Wncj rdzu: c eressi kl luivonosocnt yzn acnasttiiov. Ckg nzmj qdrs tnsoiscs lx 3 cuinoovatlnol erslya jqrw fdto itnviacatos. Mo fwjf sfce ygz bhcat onortmznaalii re dzav YDQF ayerl rx edurec teigtoifrvn nus seped qq nringtia. Cdo jncm sdrq trciurtahece soolk fxoj rjua: [YUQZ > YO > YkPD] e 3.
Figure 5.27
Similar to what we explained in the previous page, the shortcut path is added to the main path right before the activation function of the last CONV layer. Then we apply the ‘relu’ function after adding the two paths. Notice that the there are no pooling layers in the residual block. Instead, the authors of ResNet decided to do dimensions downsampling using bottleneck 1x1 convolutional layers similar to the Inception network. So, each residual block will start with a 1x1 CONV to downsample the input dimension volume + 3x3 CONV + another 1x1 CONV to downsample the output. This is a good technique to keep control of the volume dimensions across many layers. This configuration of the residual block is called the
bottleneck residual block. Now, you are ready to start building your ResNet in code. One thing left. When you are stacking residual blocks on top of each other, the volume dimensions change from one block to another. And as you might recall from the matrices introduction in chapter 2, to be able to perform the matrix addition operations, the matrices should have similar dimensions. To fix this problem, we need to downsample the shortcut path as well before merging both paths. We do that by adding a bottleneck layer (1x1 CONV + BN) to the shortcut path as you see in the diagram below. This is called the
reduce shortcut.
Figure 5.28
Before we jump in to the code implementation, let’s recap what we discussed in residual blocks:
Xecdue tcurhots: nj ihwch wo zuq c AQKZ eayrl nj pro trtoscuh qrsq fboere giegrmn rjwp kry znmj syry
When we are implementing the ResNet network, we will use both regular and reduce shortcuts. This will get clearer when you see the full implementation soon. But for now, we will implement
bottleneck_residual_block function that takes a boolean argument
reduce. When reduce = True, this means we want to use the reduce shortcut, else it will implement the regular short cut. The function takes the following arguments:
def bottleneck_residual_block(X, kernel_size, filters, reduce=False, s=2):
# unpack the tuple to retrieve Filters of each CONV layer
F1, F2, F3 = filters
# Save the input value to use it later to add back to the main path.
X_shortcut = X
# if condition if reduce is True
if reduce:
# if we are to reduce the spatial size, apply a 1x1 CONV layer to the shortcut path
# to do that, we need both CONV layers to have similar strides
X_shortcut = Conv2D(filters = F3, kernel_size = (1, 1), strides = (s,s))(X_shortcut)
X_shortcut = BatchNormalization(axis = 3)(X_shortcut)
# if reduce, we will need to set the strides of the first conv to be similar to the shortcut strides
X = Conv2D(filters = F1, kernel_size = (1, 1), strides = (s,s), padding = 'valid')(X)
X = BatchNormalization(axis = 3)(X)
X = Activation('relu')(X)
else:
# First component of main path
X = Conv2D(filters = F1, kernel_size = (1, 1), strides = (1,1), padding = 'valid')(X)
X = BatchNormalization(axis = 3)(X)
X = Activation('relu')(X)
# Second component of main path
X = Conv2D(filters = F2, kernel_size = kernel_size, strides = (1,1), padding = 'same')(X)
X = BatchNormalization(axis = 3)(X)
X = Activation('relu')(X)
# Third component of main path
X = Conv2D(filters = F3, kernel_size = (1, 1), strides = (1,1), padding = 'valid')(X)
X = BatchNormalization(axis = 3)(X)
# Final step: Add shortcut value to main path, and pass it through a RELU activation
X = Add()([X, X_shortcut])
X = Activation('relu')(X)
return X
5.5.3 ResNet implementation in Keras
Alright, we’ve learned a lot about residual blocks so far. Let’s add these blocks on top of each other to build the full ResNet architecture. In this chapter, we will implement ResNet50. It is a version of the ResNet architecture that contains 50 weight layers, hence the name ResNet50. You can use the same approach to develop ResNet with 18, 34, 101, and 152 layers by following the architecture in the table below from the
Deep Residual Learning for Image Recognition paper.
Figure 5.29
We know from the previous section that each residual module contains 3 x CONV layers, we now can compute the total number of the weight layers inside the ResNet50 network as follows:
When you sum all these layers together, you will get a total of 50 weight layers that describe the architecture of ResNet50. Similarly, you can compute the number of weight layers in the other ResNet versions.
Now let’s follow the 50-layer architecture in the table above to build the ResNet50 network. We will build a
ResNet50 function that takes the
input_shape and
classes as arguments and outputs the
model.
def ResNet50(input_shape, classes):
# Define the input as a tensor with shape input_shape
X_input = Input(input_shape)
# Stage 1
X = Conv2D(64, (7, 7), strides=(2, 2), name='conv1')(X_input)
X = BatchNormalization(axis=3, name='bn_conv1')(X)
X = Activation('relu')(X)
X = MaxPooling2D((3, 3), strides=(2, 2))(X)
# Stage 2
X = bottleneck_residual_block(X, 3, [64, 64, 256], reduce=True, s=1)
X = bottleneck_residual_block(X, 3, [64, 64, 256])
X = bottleneck_residual_block(X, 3, [64, 64, 256])
# Stage 3
X = bottleneck_residual_block(X, 3, [128, 128, 512], reduce=True, s=2)
X = bottleneck_residual_block(X, 3, [128, 128, 512])
X = bottleneck_residual_block(X, 3, [128, 128, 512])
X = bottleneck_residual_block(X, 3, [128, 128, 512])
# Stage 4
X = bottleneck_residual_block(X, 3, [256, 256, 1024], reduce=True, s=2)
X = bottleneck_residual_block(X, 3, [256, 256, 1024])
X = bottleneck_residual_block(X, 3, [256, 256, 1024])
X = bottleneck_residual_block(X, 3, [256, 256, 1024])
X = bottleneck_residual_block(X, 3, [256, 256, 1024])
X = bottleneck_residual_block(X, 3, [256, 256, 1024])
# Stage 5
X = bottleneck_residual_block(X, 3, [512, 512, 2048], reduce=True, s=2)
X = bottleneck_residual_block(X, 3, [512, 512, 2048])
X = bottleneck_residual_block(X, 3, [512, 512, 2048])
# AVGPOOL
X = AveragePooling2D((1,1))(X)
# output layer
X = Flatten()(X)
X = Dense(classes, activation='softmax', name='fc' + str(classes))(X)
# Create the model
model = Model(inputs = X_input, outputs = X, name='ResNet50')
return model
5.5.4 Learning hyperparameters
The authors followed a similar training procedure to AlexNet. Namely, the training is carried out using mini-batch gradient descent SGD with momentum = 0.9. They set the learning rate to start with value = 0.1 , and then decreased by a factor of 10 when the validation error stopped improving. They also used L2 regularization with weight decay of 0.0001 that is not implemented in this chapter for simplicity. As you saw in the implementation above, they used batch normalization (BN) right after each convolution and before activation to speed up training.
from keras.callbacks import ReduceLROnPlateau
# set the training parameters
epochs = 200
batch_size = 256
# min_lr: lower bound on the learning rate
# factor: factor by which the learning rate will be reduced
reduce_lr= ReduceLROnPlateau(monitor='val_loss',factor=np.sqrt(0.1),patience=5, min_lr=0.5e-6)
# compile the model
model.compile(loss='categorical_crossentropy', optimizer=SGD, metrics=['accuracy'])
# train the model
# call the reduce_lr value using callbacks in the training method
model.fit(X_train, Y_train, batch_size=batch_size, validation_data=(X_test, Y_test), epochs=epochs, callbacks=[reduce_lr])
5.5.5 ResNet performance on CIFAR dataset
Similar to the other networks explained in this chapter, the performance of ResNet models are benchmarked their results in the ImageNet challenge (ILSVRC). ResNet-152 won the first place in the 2015 classification competition with top-5 error rate of 4.49% with a single model and lowered to 3.57% using an ensemble of models. Which is much better than all the other networks like GoogLeNet (Inception) that achieved a top-5 error rate of 6.67%. ResNet also won the first place in many object detection and image localization challenges as we will see the in Object Detection chapter in this book. More importantly, the residual blocks concept in ResNet opened the door to a whole new possibilities to efficiently train super deep neural networks with hundreds of layers.
Using open-source implementation
Ovw rrpc hyx kxpz eedrlna moxa el xqr kmcr propalu TQG hasttreucerci, J rnzw vr rhaes jrwb geq mzxv rplctiaac ceiavd nv kwu vr oha mgor. Jr strnu rqk gsrr z frx el hsete rueanl wsnoktre toz cuditfilf tx yfkciin kr eeictaprl casebue s efr kl latsdei uobta unnitg el krq aaphyresrpeetrm qyzz az grlnaeni cadey ncu ehort gthins rrzd mvxc cxvm fdeeefrnci rv vru orameecrnfp. Oxvu geannilr ssrceerhear san nxxo bzxk c tsyb romj lgceirntiap noeseom s'esle oielsdph wtvk cird vmtl enadrgi rehti ppaer. Ltaeuonytlr, c fkr xl hokd igalnenr ecrsaehserr ieyruotln vynv rcoseu rethi etwv nv pxr Jetnnetr, baps zz nv DrjHdh. X lmsipe racesh nk Oubhit nk rgo otrwkne itmtlampoennei dolwu otpin vbh doastrw neimimttpsolena nj vreslea kvhq eirgnlan libersair urrz qkg nca encol snh atrni. Xsaeceu jl uyv sns urv ryo 'haoutsr tmeiltnmpeaion, hxu nzz aulslyu rpx ngigo mgsg sftaer crnu rygnit re mempeitreln jr lmvt hrstcac. Xulhgtho esoestimm emienigerpmnlt mtel ccrhtas dlouc gv c depe creseiex rv qe as fwfx jxfk wzru wo bjh lerriae.
5.6 Summary and takeaways
Jn jyar peacrth, kw’vx edlxneapi yor rowketn etcaruhtecirs el eojl luappro XDD rweostnk. Agx acisclsal AGO eehccsitrtaur: EoGor, YfvvDrv, yzn FUQGrv, unz kgr cdvneaad eictruhertasc: BkcUro, npc Jicopnnte (QlogeoDro).