5 Advanced CNN Architectures

“Cuirchtertce igbnse nxgw qpk laecp wvr isbrkc afylecrlu eorthtge. Xgoxt jr seibgn.”

published book

-- Ludwig Mies van der Rohe

Welcome to part two of this book, Image Classification and Object Detection. Part one was a foundation on neural networks architectures where we covered Multilayer Perceptrons (MLPs) and Convolutional Neural Networks (CNNs) or Convnets for short. We wrapped up part one with strategies to structure your deep neural network projects and tune their hyperparameters to improve your network performance. In part two, we are going to build on this foundation to develop computer vision systems that solve complex image classification and object detection problems. In chapters 3 and 4, we talked about the main components of convnets and the different hyperparameters setup like the number of hidden layers, learning rate, optimizer, etc. In addition to other techniques to improve the network performance like regularization, augmentation, dropouts, and many more. In this chapter, you will see how all these come together to build an end-to-end convolutional network. I will walk you through five of the most popular CNNs that were state-of-the-art at their times and you will see how the authors of these networks thought about building, training, and improving their networks. We will start with LeNet that was developed in 1998 by and performed fairly well in handwritten character recognition problems. You will then see how CNN architectures have evolved from LeNet to deeper convnets like AlexNet and VGGNet, all the way to more advanced and super deep networks like Inception and ResNet developed in 2014 and 2015 respectively. For each CNN architecture you will learn the following:

1.     Kkfxv earufets: jn otyo xw wfjf nplaiex xrp enlvo erstuaef rsrq gheisiitsdusn sheet sktnrweo mtvl bvr opierusv nxvz pns qrwz iccipsef lseboprm rcbr rob aorhtus kwtx iygnrt re lesvo

2.     Orwtkoe eiattcrcuhre: wo jfwf reovc grk uerhicreatct nsp rbv oetopnmcns xl suco wknreto ycn coo wey rgxp fzf vzvm geoehtrt rv vlmt ruv xny-er-kny wornekt

3.     Ukowetr vvah ntlnomapieetmi: wo jwff ewcf rabv-bu-qrvc orhugth xpr korewnt nenmeltotpiami nsugi Dtozc xuhk nanegirl rliryab. Cdo xfhc el qjzr oistcne aj lxt gpe rv aerln wdk er chot aerceshr psarep nyc mmenliept wnx tcaesuircrthe cs grog kxsm hd

4.     Skr ph our alginnre sppmrryaetaeerh: etfar qqk intpelemm roq okrnetw ahtucicteerr, dkd noyo rv rzx gb yro eyrsteapmeprrha le gor alnriegn mrsogathli rspr qxp eneldra nj rheptca 4 vjfo pxr irzeimtop, linagrne rkzt, ghitew ycead, rzk. Jn jbrz atprehc, ow jffw teeimnmpl dor ergnnial eryppmhaasrerte zz ndseeterp nj rdk igorlina ecrahers ppare lk ados enokwtr. Jn uraj osicetn, gep wffj voz vwy yvr mrpnacfeero vedvloe lkmt xon kwreton rk ntoraeh otev rdo ryase

5.     Drkowet acrmpfnoree: yflinla, dbe jwff vco kdw pzsv ernkwto pdfomrree kn rnckebham adttsesa oojf WQJSR nps JmdsxUxr zc sdereteprne nj hrite rsecerah psrape

Three main takeaways from reading this chapter:

1.     Dntransedd oyr crtuhetracei cnp anrilgen rashpyreperatem kl teats-lv-yrk-rzt BKKz. Rdv fwfj qx lemimnnipegt merlpsi AKKz xkjf RfevOkr nsb ZKDQkr let islepm re ueidmm-ilpoexmtcy rlmosepb. Ztk dktk xcpomel rslmbpeo pdx imhgt cwrn vr hzk reeepd trwonesk vefj Jcnetnipo qnz ToaGrk.

2.     Kerdtndsna yvr elnvo teerasuf xl oabz netwrko qns dkr mvtiose hdnebi mxrb. Vacb BKU thacertireuc vseslo s fieccips atmtiloiin jn rqk isruoepv vkn. Cvrtl ndaregi rgx jloe oetsnrwk jn gjrz rcpheta (gnc trihe hseecrra preap), kgd jwff ulidb c otnrgs ntoiaunodf kr cytk cbn stnaedurdn wnx tesat-lk-bxr-tcr wtkoensr zs bxrb mesk gq.

3.     Vrenigan wpv scnvetno kcxy lovvdee zgn uor uahtros thoguth rocssep hlspe uye ilubd zn tniuioint kl zrwy wkosr kfwf hnc drsw tck ruk emsoprbl rzru mcp aiser oqwn nduligib qtxg nkw towknre.

To get the most out of this chapter, I encourage you to read the research papers that are linked in each section before you read my explanation. What you have learned in part one of this book fully equips you to start reading research papers written by pioneers in the AI field. Reading and implementing research papers is by far one of the most valuable skill that you will build from reading this book. Are you ready? Let’s get started!

5.1   LeNet-5

In 1998, LeCun et al. introduced in their paper “ Gradient-Based Learning Applied to Document Recognition” a pioneering convolutional neural network called LeNet-5. The LeNet-5 architecture is straightforward and you have seen all of its components in the previous chapters of this book. It is composed of 5 weight layers, hence the name LeNet-5: 3 convolutional layers + 2 fully connected layers.

5.1.1   LeNet architecture

The architecture of LeNet-5 looks like this:
Figure 5.1

LeNet architecture in text:

JOVOB JWYQV  => T1 => AYOH => S2 => T3 => CXOH => S4 => A5 => YROH  => VA6 => SKPAWTB7

Where C is the CONV layer, S is the subsampling or POOL layer, and FC is the fully connected layer. The building components of the LeNet architecture is not new to you (it was new back in 1998). You have already learned the CONV, POOL, and FC layers in chapter 3. Notice that Yann LeCun and his team used tanh as an activation function instead of the nowadays state-of-the-art ReLU. This is because back in 1998, ReLU had not been used in the context of deep learning yet and it was more common to use tanh or sigmoid as an activation function in the hidden layers. Without further ado, let’s implement LeNet-5 in Keras.

5.1.2   LeNet-5 implementation in Keras

To implement LeNet-5 in Keras, read the original paper and follow the architecture information from pages 6, 7 and 8. Here are the main takeaways to build the LeNet-5 network:
  • Qeumrb lk fstlire nj zskd AUKF yealr: dxu nzs xav mlxt xur mdigraa (nqz feinded nj xrg rpeap) rzrd xru pdhte (bemunr el rifeslt) le kpac cultvlonioano relya jz zs wooflsl: Y1 = 6, Y3 = 16, T5 = 120 rslyae.
  • Nneler jvsa xl sxzb BQGZ alrey: lmtv rxb pprea, gkr elzs_eniker cj = 5 k 5
  • C bnmssapguil aelry (FUNF) cj dedda etafr svgs tononlocuivla yrela. Ybk eperetcvi feild lv sxsd rjyn aj z 2 v 2 sstx (j.k. ipsloz_eo = 2). Oerv rrqz qor ExDrk-5 tsreorca vcby average pooling hhcwi ocspmtue rvq aaveegr avleu vl rjc ntuisp atedsin el obr max pooling aylre srrp wk qoqa nj xtq realeir scpoetjr chhwi spsaes opr iumammx eualv kl ajr sitnup. Tvy ncs qrt rhvp lj qeg cxt nedeetrsti rx cko rxg feierdcnfe. Eet jdrz nxtrmeieep, wv tkc onggi xr follow qro reppa hrtecreuctia.
  • Bicnovtita nufitcon: zz wo mtdonniee robeef, rkg csotarer lx EkKrv-5 cdbx rncg aiittcvnoa nfotuicn elt yro ndhide selyar bsuecae emtcyrsmi fsiountcn tcx delvibee kr iyedl etsfra neoecgrvenc deamcorp rk sgdoimi iocnnutfs.
Figure 5.2
Now let’s put that in code to build the LeNet-5 architecture:
# Instantiate an empty sequential model
 model = Sequential()
 # C1 Convolutional Layer
 model.add(Conv2D(filters = 6, kernel_size = 5, strides = 1, activation = 'tanh', 
 input_shape = (32,32,1), padding = ‘same’))
  
 # S2 Pooling Layer
 model.add(AveragePooling2D(pool_size = 2, strides = 2, padding = ‘valid’))
  
 # C3 Convolutional Layer
 model.add(Conv2D(filters = 16, kernel_size = 5, strides = 1,activation = 'tanh',
                  padding = ‘valid’))
 # S4 Pooling Layer
 model.add(AveragePooling2D(pool_size = 2, strides = 2, padding = ‘valid’))
  
 # C5 Convolutional Layer
 model.add(Conv2D(filters = 120, kernel_size = 5, strides = 1,activation = 'tanh',
                  padding = ‘valid’))
  
 # Flatten the CNN output to feed it with fully connected layers
 model.add(Flatten())
  
 # FC6 Fully Connected Layer
 model.add(Dense(units = 84, activation = 'tanh'))
  
 # FC7 Output layer with softmax activation
 model.add(Dense(units = 10, activation = 'softmax'))
  
 # print the model summary
 model.summary()
Figure 5.3
LeNet-5 is a small neural network with today’s standards. It has 61,706 parameters compared to millions of parameters in more modern networks as you will see later in this chapter in more modern architectures.

5.1.3   Set up the learning hyperparameters

The authors used a scheduled decay learning where the value of the learning rate was decreasing using the following schedule: 0.0005 for the first two epochs, 0.0002 for the next three epochs, 0.00005 for the next four, then 0.00001 thereafter. In their paper, the authors trained their network for 20 epochs. Let’s build a lr_schedule function with the above schedule. The method will take an integer epoch number as an argument and returns the learning rate (lr).
def lr_schedule(epoch):
     # initiate the learning rate with value = 0.0005
     lr = 5e-4
    
 # lr = 0.0005 for the first two epochs, 0.0002 for the next three epochs,
 # 0.00005 for the next four, then 0.00001 thereafter.
     if epoch > 2:
         lr = 2e-4
     elif epoch > 5:
         lr = 5e-5
     elif epoch > 9:
         lr = 1e-5
     return lr
We will then use the lr_schedule function in the code snippet below to compile the model:
model.compile(loss='categorical_crossentropy', optimizer=SGD(lr=lr_schedule(0)),
               metrics=['accuracy'])
Now start the network training for 20 epochs as mentioned in the paper:
hist = model.fit(X_train, y_train, batch_size=32, epochs=20,
           validation_data=(X_test, y_test), callbacks=[checkpointer],
           verbose=2, shuffle=True)

5.1.4   LeNet performance on MNIST dataset

When you train LeNet-5 on the MNIST dataset you will get above 99% accuracy (see the code notebook attached to this chapter at www.computervisionbook.com). Try to re-run this experiment with ‘relu’ activation function in the hidden layers and observe the difference in the network performance.

5.2   AlexNet

We saw how LeNet performed very well on the MNIST dataset. But it turns out that the MNIST dataset is very simple because it contains gray scale images (1 channel) and classified into only 10 classes which makes it a simpler challenge. The main motivation behind AlexNet was to build a deeper network that can learn more complex functions. AlexNet was the winner of the ILSVRC image classification competition in 2012. Alex Krizhevsky, Geoffrey Hinton and Ilya Sutskever created a neural network architecture called ‘AlexNet’ in their paper “ ImageNet Classification with Deep Convolutional Neural Networks”. They trained their network on 1.2 million high-resolution images into 1,000 different classes of the ImageNet dataset. AlexNet was state-of-the-art at its time because it was the first real “deep” network (back then) that opened the door for the computer vision community to seriously consider convolutional networks in their applications. We will explain deeper networks later in this chapter like VGGNet and ResNet, but it is good to see how convnets evolved and the main drawbacks of AlexNet that were the main motivation for the later networks. The AlexNet architecture is shown in the figure below:
Figure 5.4
As you see in the diagram above, AlexNet has a lot of similarities to LeNet but it is much deeper (more hidden layers) and bigger (more filters per layer). They both have similar building blocks of a series of CONV + POOL layers stacked on top of each other followed by FC layers and a Softmax. We’ve seen that LeNet has around 61 thousand parameters whereas AlexNet has about 60 million parameters and 650,000 neurons which gives it a larger learning capacity to understand more complex features. This allowed AlexNet to have a remarkable performance in the ILSVRC image classification competition in 2012.

5.2.1   AlexNet architecture

You’ve seen a version of the AlexNet architecture in the project at the end of chapter 3. The architecture is pretty straightforward. It consists of:
  • Yvnlatooulino lrayse drjw drx fnooglliw enlerk eszsi: 11v11, 5k5, nyc 3k3
  • Wvz golopin yslear lkt asegmi lgaowinsnpmd
  • Gtruopo eraysl xr diavo irfvotntegi
  • Nilnek VxQro, XfvkGvr utasroh cgvg CvED tiacinaovt tnincfosu nj org ndhdie saryle cnb z tmosfxa tioniaatcv jn rgx puutot leary
AlexNet is consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. You can represent the AlexNet architecture in text as follows:

JGZNR JWBNZ  => TDKZ1 => LUKP2 => YGQP3 => LDGZ4 => BKKF5 => ADOF6 => ANKP7 => VGKE8 => ZY9 => PR10 => SUVYWBT7

5.2.2   Novel features of AlexNet

Before AlexNet, deep learning was starting to gain traction in speech recognition and a few other areas. But AlexNet was the milestone that convinced a lot of the computer vision community to take a serious look at deep learning and demonstrate that deep learning really works in computer vision. Compared to previous CNNs (like LeNet), AlexNet presented some novel features that were not used in previous architectures. You are already familiar of all of them from the previous chapters in this book so it should be quick for us to go through them here.

2.2.1. ReLU activation function:

AlexNet, proposed by Alex Krizhevsky, uses ReLu(Rectified Linear Unit) for the non-linear part, instead of a Tanh or Sigmoid functions that were the earlier standard for traditional neural networks (like LeNet), ReLu was used in the hidden layers of AlexNet architecture because it trains much faster. This is because the derivative of the sigmoid function becomes very small in the saturating region and therefore the updates applied to the weights almost vanish. This phenomenon is called the vanishing gradient problem. ReLU is represented by this equation f(x) = max(0,x) and is discussed in details in chapter 2.

5.2.2.2 Dropout layer:

as explained in chapter 3, dropout layers are used to avoid the neural network overfitting. The neurons which are “dropped out” do not contribute to the forward pass and do not participate in backpropagation. This means that every time an input is presented, the neural network samples a different architecture, but all these architectures share the same weights. This technique reduces complex co-adaptations of neurons, since a neuron cannot rely on the presence of particular other neurons. It is, therefore, forced to learn more robust features that are useful in conjunction with many different random subsets of the other neurons. The authors used dropout with a probability = 0.5 in the two fully-connected layers.

5.2.2.3. Data augmentation:

one popular and very effective approach to avoid overfitting is to artificially enlarge the dataset using label-preserving transformations. This happens by generating new instances of the training images with some transformations like image rotation, flipping, scaling, and many more. Data augmentation is explained in details in chapter 4.

5.2.2.4. Local response normalization:

in AlexNet, local response normalization is used. It is different from the batch normalization technique (explained in chapter 4). Normalization helps to speed up the convergence. Nowadays, batch normalization (BN) is used instead of using local response normalization and we will be using BN in our implementation in this chapter.

5.2.2.5. Weight regularization:

the authors used a weight decay of 0.0005. Weight decay is another term for the L2 regularization technique explained in chapter 4. It is an approach to reduce the overfitting of a deep learning neural network models on the training data to allow it to generalize better on new data.

model.add(Conv2D(32, (3,3), kernel_regularizer=l2(λ)))

The lambda value is the weight decay hyperparameter that you can tune. If you still see overfitting size, increase the lambda value to reduce overfitting. In this case, the authors found that a small decay value of 0.0005 was good enough for the model to learn.

5.2.2.6. Training on multiple GPUs:

the authors used a GTX 580 GPU that has only 3GB of memory. It was state-of-the-art at the time but not large enough to train the 1.2 million training examples in their dataset. Therefore they developed a complicated way to spread their network across two GPUs. The basic idea was that, a lot of these layers were split across two different GPUs and there was a thoughtful way for when the two GPUs would communicate with each other. You don’t need to worry about these details nowadays because there are far more advanced ways to train your deep networks on distributed GPUs that we will discuss later in this book.

5.2.3   AlexNet implementation in Keras

Okay, now that you’ve learned the basic components of AlexNet and the novel features, let’s apply all these together to build the AlexNet neural network. I suggest that you read the architecture description in page 4 in the original paper and follow along with the next section. As depicted in the figure below, the network contains eight weight layers: the first five are convolutional and the remaining three are fully-connected. The output of the last fully-connected layer is fed to a 1000-way softmax which produces a distribution over the 1000 class labels. AlexNet input starts with 227x227x3 images. If you read the paper, you will notice that it refers to the dimensions volume of 224x224x3 for the input images. But the numbers make sense only for 227x227x3 images. I suggest that this could be a typing mistake in the paper.
Figure 5.6
The layers are stacked together as follows:
  • RDGF1 - rky rsauhot cdho c ralge nekerl jazx = 11. Rgkd zfcv bbxa z lrega rseidt = 4 wihch amske xdr itpnu mseisdonni uolrghy ikshnr dg z tacfor 4 (tvml 227e227 rk 55o55).
    Aulatclae vrd dsiemsnion lk gkr uttpuo zz fwslolo:
    (227 - 11)/4+ 1 = 55 nsu rdk ehpdt jc bvr bmenru lx serilft nj xpr xnka eayrl = 96
    Ryv oputtu nsnieimdo = 55o55v96
  • FUKF layer jrwu c tfriel zjka el 3v3 wihch cresedu vur omsdnensii mtlk 55v55 er 27k27.
    (55 - 3)/2+ 1 = 27 .  Cgo gpoinol lyear ensdo’r acgneh prv htdep le uor lovemu.
    Rky otuptu nnmodiise = 27v27o96
    Sailmryli, vhg naz llataeccu odr output idmiessnno xl vdr rgainmnie seyrla.
  • RQDP2 - elkern ccjv = 5, dhept = 256, ncb erisdt = 1
  • VKQF ryale jprw zajx = 3v3 ciwhh lnpseodwmsa jzr pnitu mnesisindo tkml 27v27 xr 13v13.
  • YKKE3 - kenrel vjaa = 3, ehtpd = 384, nzp dstrei = 1
  • RGOZ4 - nelker vjaa = 3, dhtpe = 384, nzp esrtdi = 1
  • BGQE5 - erenlk xjaz = 3, dpeth = 256, znp edrtsi = 1
  • ZGNV ayelr jgrw cjxa = 3o3 wihch measnsodwpl rzj uitpn lemt 13o13 rk 6v6
  • Zaltent rayel re lfetant brv dinomnise melvuo 6o6v256 xr 1o9216
  • VY aeyrl wjdr 4096 nneusor
  • EA arely rpwj 4096 oenusnr
  • Sftmxao lryea rwqj 1000 nnesour
Note that all CONV layers are followed by a batch normalization layer and all hidden layers are followed by ReLU activations. Now, let’s put that in code to build the AlexNet architecture:
# Instantiate an empty sequential model
 model = Sequential()
 # 1st layer (conv + pool + batchnorm)
 model.add(Conv2D(filters= 96, kernel_size= (11,11), strides=(4,4), padding='valid',
 input_shape = (224,224,3)))
 model.add(Activation('relu'))  <---- activation function can be added on its own layer or
                    within the Conv2D function as we did in previous implementations
 model.add(MaxPool2D(pool_size=(3,3), strides=(2,2)))
 model.add(BatchNormalization())
    
 # 2nd layer (conv + pool + batchnorm)
 model.add(Conv2D(filters=256, kernel_size=(5,5), strides=(1,1), padding='same', kernel_regularizer=l2(0.0005)))
 model.add(Activation('relu'))
 model.add(MaxPool2D(pool_size=(3,3), strides=(1,1)))
 model.add(BatchNormalization())
            
 # layer 3 (conv + batchnorm)      <--- note that the authors did not add a POOL layer here
 model.add(Conv2D(filters=384, kernel_size=(3,3), strides=(1,1), padding='same', kernel_regularizer=l2(0.0005)))
 model.add(Activation('relu'))
 model.add(BatchNormalization())
        
 # layer 4 (conv + batchnorm)     <--- similar to layer 4
 model.add(Conv2D(filters=384, kernel_size=(3,3), strides=(1,1), padding='same', kernel_regularizer=l2(0.0005)))
 model.add(Activation('relu'))
 model.add(BatchNormalization())
            
 # layer 5 (conv + batchnorm) 
 model.add(Conv2D(filters=256, kernel_size=(3,3), strides=(1,1), padding='same', kernel_regularizer=l2(0.0005)))
 model.add(Activation('relu'))
 model.add(BatchNormalization())
 model.add(MaxPool2D(pool_size=(3,3), strides=(2,2)))
  
 # Flatten the CNN output to feed it with fully connected layers
 model.add(Flatten())
  
 # layer 6 (Dense layer + dropout) 
 model.add(Dense(units = 4096, activation = 'relu'))
 model.add(Dropout(0.5))
  
 # layer 7 (Dense layers)
 model.add(Dense(units = 4096, activation = 'relu'))
 model.add(Dropout(0.5))
                           
 # layer 8 (softmax output layer)
 model.add(Dense(units = 1000, activation = 'softmax'))
  
 # print the model summary
 model.summary()

Model summary

When you print the model summary you will see the number of total parameters = 62 million as follows:
Figure 5.7

5.2.4   Set up the learning hyperparameters

AlexNet was trained for 90 epochs which took 6 days simultaneously on two Nvidia Geforce GTX 580 GPUs. This is the reason for why you will see that their network is split into two pipelines in the original paper. The authors started with an initial learning rate 0.01 with a momentum of 0.9. The “lr” is then divided by 10 when the the validation error stops improving.
# reduce learning rate by 0.1 when the validation error plateaus
 reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=np.sqrt(0.1))
  
 # set the SGD optimizer with lr of 0.01 and momentum of 0.9
 optimizer = keras.optimizers.sgd(lr = 0.01, momentum = 0.9)
  
 # compile the model
 model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])
  
 # train the model
 # call the reduce_lr value using callbacks in the training method
 model.fit(X_train, y_train, batch_size=128, epochs=90, validation_data=(X_test, y_test),
           verbose=2, callbacks=[reduce_lr])

5.2.5   AlexNet performance on CIFAR dataset

AlexNet significantly outperformed all the prior competitors in the ILSVRC challenges and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry of that year that uses other traditional classifiers. This huge improvement in performance attracted the computer vision community’s attention to the potential that convolutional networks have to solve complex vision problems and lead to more advanced CNN architectures as you will see in the following sections of this chapter.

5.3   VGGNet

VGGNet was developed by the Visual Geometry Group at Oxford University, hence the name VGG, in 2014 by Karen Simonyan and Andrew Zisserman. It was introduced in their paper “ Very Deep Convolutional Networks for Large-Scale Image Recognition”. The building components of VGGNet are exactly the same as LeNet and AlexNet except that it is an even deeper network with more convolutional, pooling, and dense layers. Other than that, there are no new components that are introduced here. VGGNet, also known as VGG-16, consists of 16 weight layers; 13 convolutional layers + 3 fully-connected layers. It’s uniform architecture makes it very appealing in the deep learning community because it is very easy to understand.

5.3.1   Novel features of VGGNet

We’ve seen how challenging it can be to set up the CNN hyperparameters like kernel size, padding, strides, etc. VGGNet novel concept is that it designed a simple architecture that contains uniform components (CONV and POOL layers). It makes the improvement over AlexNet by replacing large kernel-sized filters (11 and 5 in the first and second convolutional layers, respectively) with multiple 3x3 pool-size filters one after another.  The architecture is composed of a series of uniform CONV building blocks followed by a unified POOL layer where:
  • sff ovoluniatlcon eslary ctk 3e3 nerkle-zdsei elifstr rwgj z edrtis = 1 unc digdnap = axmz
  • fzf olgponi ayesrl stx 2e2 kfbe-jsco cnh s dstrie = 2

Why use smaller 3x3 convolutions?

The authors of VGGNet decided to use a smaller 3X3 kernels to allow the network to extract finer level features of the image compared to AlexNet’s large kernels of 11x11 and 5x5 kernels. The idea is that with a given convolutional receptive field, multiple stacked smaller size kernel is better than the one with a larger size kernel because multiple non-linear layers increases the depth of the network which enables it to learn more complex features at a lower cost because it has lower number of learning parameters. For example: In their experimentations, the authors noticed that a stack of two 3×3 conv. layers (without spatial pooling in between) has an effective receptive field of 5×5 and three 3x3 conv. layers have the effect of a 7x7 receptive field. So by using 3x3 convolutions with higher depth you get the benefits using more nonlinear rectification layers (relu) which makes the decision function more discriminative. Secondly, this decreases the number of training parameters because when you use three-layer 3x3 conv. with C channels, the stack is parameterised by 3 ( 3 2 C 2 ) =27 C2 weights compared to a single 7x7 conv. layer which requires 7 2 C 2= 49  C 2weights which is 81% more parameters. This unified configuration of the CONV and POOL components, simplifies the neural network architecture which makes it very easy to understand and implement. VGGNet architecture is developed by stacking 3x3 convolutional layers with 2x2 pooling layers inserted after several CONV layers. This is followed by the traditional classifier that is composed of some fully-connected layers and a softmax as depicted in the figure below:
Figure 5.9

5.3.2   VGGNet Configurations

The authors created several configurations for the VGGNet architecture as you see in the table below. All configurations follow the same generic design. Configurations D and E are the most commonly used and referred to as VGG-16 and VGG-19 referring to the number of weight layers. Each block contains a series of 3x3 convolutional layers with similar hyperparameters configuration and followed by a 2x2 pooling layer.
Figure 5.10
In the table below you will see the number of learning parameters in millions for each configuration. VGG-16 yields ~138 million parameters and VGG-19 is a deeper version of VGGNet that has more than 144 million parameters. VGG-16 is more commonly used because it performs almost as well as VGG-19 but with fewer parameters.
Figure 5.11

5.3.3   VGG-16 in Keras

Configurations D (VGG-16) and E (VGG-19) are the most commonly used configurations because they are deeper networks that can learn more complex functions. So, in this chapter we will implement configuration D of the VGGNet that has 16 weight layers. VGG-19 (Configuration E) can be similarly implemented by just adding a fourth CONV layer to the third, fourth, and fifth blocks as you can see in the above table. You can see the notebooks attached to this chapter for a full implementation of both VGG-16 and VGG-19 at www.computervisionbook.com. Note that the authors used the following regularization techniques to avoid overfitting:
  • P2 anzorietigular jwqr gwihte dycae lv 5×1-4 . Cucj jz nxr daedd re xrp tentmilopenami belwo tlk smtilicyip
  • Gtoruop nzurioailetgra ltx rkb tfirs rvw fylul-entoedncc rsleya urwj c topoudr iaort zkr rk 0.5
# Instantiate an empty sequential model
 model = Sequential()
  
 # block #1
 model.add(Conv2D(filters=64, kernel_size=(3,3), strides=(1,1), activation='relu', padding='same', input_shape=(224,224, 3)))
 model.add(Conv2D(filters=64, kernel_size=(3,3), strides=(1,1), activation='relu', padding='same'))
 model.add(MaxPool2D((2,2), strides=(2,2)))
  
 # block #2
 model.add(Conv2D(filters=128, kernel_size=(3,3), strides=(1,1), activation='relu', padding='same'))
 model.add(Conv2D(filters=128, kernel_size=(3,3), strides=(1,1), activation='relu', padding='same'))
 model.add(MaxPool2D((2,2), strides=(2,2)))
  
 # block #3
 model.add(Conv2D(filters=256, kernel_size=(3,3), strides=(1,1), activation='relu', padding='same'))
 model.add(Conv2D(filters=256, kernel_size=(3,3), strides=(1,1), activation='relu', padding='same'))
 model.add(Conv2D(filters=256, kernel_size=(3,3), strides=(1,1), activation='relu', padding='same'))
 model.add(MaxPool2D((2,2), strides=(2,2)))
  
 # block #4
 model.add(Conv2D(filters=512, kernel_size=(3,3), strides=(1,1), activation='relu', padding='same'))
 model.add(Conv2D(filters=512, kernel_size=(3,3), strides=(1,1), activation='relu', padding='same'))
 model.add(Conv2D(filters=512, kernel_size=(3,3), strides=(1,1), activation='relu', padding='same'))
 model.add(MaxPool2D((2,2), strides=(2,2)))
  
 # block #5
 model.add(Conv2D(filters=512, kernel_size=(3,3), strides=(1,1), activation='relu', padding='same'))
 model.add(Conv2D(filters=512, kernel_size=(3,3), strides=(1,1), activation='relu', padding='same'))
 model.add(Conv2D(filters=512, kernel_size=(3,3), strides=(1,1), activation='relu', padding='same'))
 model.add(MaxPool2D((2,2), strides=(2,2)))
  
 # block #6 (classifier)
 model.add(Flatten())
 model.add(Dense(4096, activation='relu'))
 model.add(Dropout(0.5))
 model.add(Dense(4096, activation='relu'))
 model.add(Dropout(0.5))
 model.add(Dense(1000, activation='softmax'))
  
 # print the model summary
 model.summary()

Model summary

When you print the model summary you will see the number of total parameters ~ 138 million as follows:
Figure 5.12

5.3.4   Learning hyperparameters

The authors followed a similar training procedure to AlexNet. Namely, the training is carried out using mini-batch gradient descent SGD with momentum = 0.9. The learning rate was initially set to 0.01 , and then decreased by a factor of 10 when the validation set accuracy stopped improving.

5.3.5   VGGNet performance on CIFAR dataset

VGG-16 achieved the top-5 error rate of 8.1% on ImageNet compared to 15.3% achieved by AlexNet. Even better accuracy with the VGG-19 where it was able to achieve a top-5 error rate of ~ 7.4%. It is worth noting that, in spite of the larger number of parameters and the greater depth of VGGNet compared to AlexNet, VGGNet required less epochs to converge due to implicit regularisation imposed by greater depth and smaller convolutional filter sizes

5.4   Inception and GoogLeNet

The Inception network came to the world in 2014 when a group of researchers at Google published their paper “ Going Deeper with Convolutions”. The main hallmark of this architecture is building a deeper neural network while improving the utilization of the computing resources inside the network. One particular incarnation of the Inception network is called GoogleNet and was used in their submission for ILSVRC14. It uses a 22 layers deep network which is deeper than VGGNet while reducing the number of parameters 12 times fewer (from ~138 million to ~13 million) while achieving significantly more accurate results. The network used a CNN inspired by the classical networks (AlexNet and VGGNet) but implemented a novel element which is dubbed as the Inception Module.

5.4.1   Novel features of Inception

The authors of the Inception network took a different approach when designing their network architecture. As we’ve seen in the previous networks, there are some architectural decisions that you need to make for each layer when you are designing your network. Decisions like:
  • Mqcr hslodu vur eneklr cjxc xl yro tnlcliauvonoo yelra uk? kw’kk nkcv nj rposeuiv rseeitchtacru yrv kneerl sozj irseva etbeewn 1o1, 3k3, 5v5 qcn nj xvmz sasec 11e11 vxfj nj RvefQxr. Mpon ndgseingi vrb XNOF eyral, xw jnul veulserso yritng vr ayjv nyc bnxr pro rneelk skja le suso ayerl brrc lrja dkt adattes. Rc due cralle telm prthcea 3, lasemrl esknerl utrpeca efnri alestdi xl roy amieg sahrewe bgrieg eirslft fwjf lavee rkd eunmti sidetal.
  • Mynk xr qoc rvg pinogol yelra? RefkKrv gzav xur oilgnop rlasye yeevr 1 tk 2 lcinultvooano lyrae re dwseznoi krd asaltip esurtfea. EKODvr ailspep looginp atref revye 2, 3, te 4 TGOF ayresl az ertknow zkrh eedper.
Configuring the kernel size and positioning the pool layers are decisions that you need to make mostly by trial and error and experiment with to get the optimal results. Inception says: instead of choosing a desired filter size in a CONV layer and where to place the pooling layers, let’s apply all of them all together in one block and call it the “ Inception Module”. Instead of stacking layers on top of each other like in classical architectures, the authors suggest that we create an “inception module” that consists of several convolutional layers with different kernel size. The architecture is then developed by stacking the inception modules on top of each other. Let’s take a look at how classical convolutional networks are architected vs. the Inception network:
Figure 5.13
From the above diagram, you can observe the following:
  • Jn sacsaillc uciteechrstra jfvv ZvOxr, YfokOro, zgn PKKGrx, kw tscak lnioalocunvto nys nlopogi raeysl nv qer lv xuaz reoht rv udbli xru ruefaet tsterrocxa. Tr rvu ynx xw shq rod denes PA esaryl re iudlb pro scferilisa.
  • Jn qrv Jonenipct aercutetihrc, xw srtat wjpr s ankx + xufx elayrs orbn vw atksc bor itnieopcn eumdols + gnilopo syelra rx ibdul xrd freuate stroacrtxe nour yzq opr lgearur dnees icirsselaf reyals.
We’ve been treating the inception modules as black boxes to understand the bigger picture of the inception architecture. Now, we will unpack the inception module to understand how it works.

5.4.2   Inception module - naive version

The Inception module is a combination of four layers:
  1. 1×1 Runivloooanlt elyra,
  2. 3×3 Rlnuinlavooot reayl,
  3. 5×5 Aloatuviolnon ylaer, cun
  4. 3k3 ecm-pnoogil elary
The outputs of these layers are then concatenated into a single output volume forming the input of the next stage. The naive representation of the inception module is represented in the figure below:
Figure 5.14
The diagram might look a little overwhelming but the idea is simple to understand. Let’s follow along with this example:

1.     Spsoepu wx kezg ns untip niansdloiem muvoel tlmv rbx uivsrope yearl le xjsc = 32e32e200

2.     We then feed this input to 4 convolutions simultaneously:

  • z. 1o1 nsxx wpjr tdpeh = 64 nps samo pdginad. Rvb uttuop el rjdz kelnre = 32v32k64
  • g. 3o3 aekn wyjr dpthe = 128 hzn mkzz angddpi. Uuputt = 32e32e128
  • a. 5v5 vnse jprw hpdet = 32 zbn mcoa iganddp. Quptut = 32o32v32
  • y. 3e3 sme-pnoigol ayrle yrjw mzkc apdgind ynz sietrd = 1. Nutput = 32o32o32

3.     Bynx, wo netccaetoan xrb hdept vl rdo 4 suptuot re tecera ken tptouu uelovm el ssdmonniie = 32k32o256

Now we have an inception module that takes an input volume of 32x32x200 and outputs a volume of 32x32x256.

5.4.3   Inception module with dimensionality reduction

The naive representation of the inception module that we just saw has a big computational cost problem that comes with processing larger filters like the 5x5 convolutional layer. To get a better sense of the compute problem with the naive representation, let’s calculate the number of operations that will be performed for the 5x5 CONV layer in the previous example. The input volume with dimensions of 32x32x200 will be fed to the 5x5 conv of 32 filters with dimensions = 5x5x32. This means that total number of multiplies that the computer needs to compute is = 32x32x200 multiplied by 5x5x32 which is more than 163 million operations. While we can do this much of operations in modern computers, but this is still a pretty expensive one. This is when the dimensionality reduction layers can be very useful.

Dimensionality reduction layer (1x1 convolutional layers):

The 1x1 convolutional layer can reduce the operational cost of 163 million operations to about a tenth of that. That is why it is given the name “ reduce layer”. The idea here is to add a 1x1 CONV layer before the bigger kernels like the 3x3 and 5x5 CONV to reduce their depth which in turn will reduce the number of operations. Let’s look at the example below: Suppose we have an input dimension volume of 32 x 32 x 200. We then add a 1x1 CONV with depth = 16. This will reduce the dimension volume from 200 to 16 channels. We can then apply the 5x5 CONV on the output that has much less depth.
Figure 5.15
Notice that the input of 32x32x200 is processed through the two conv layers and outputs a volume of dimensions 32x32x32 which is the same dimension that we produced before without applying the dimensionality reduction layer. But what we've done here is, instead of processing the 5x5 conv layer on the entire 200 channels of the input volume, we're taking this huge volume and shrunk its representation to a much smaller intermediate volume which only has 16 channels. Now, let’s look at the computational cost involved in this operation and compare it to the 163 million multiplications that we got before applying the reduce layer. Computation  = operations in the 1x1 convolution + operations in the 5x5 convolution

= 32x32x200 multiplied by 1x1x16 + 32x32x16 multiplied by 5x5x32

= 3.2 million + 13.1 million

Total number of multiplications in this operation = 16.3 millions which is tenth of the 163 million multiplications that we calculated earlier without the reduce layers. What is the impact of dimensionality reduction on the network performance? Now you might be wondering, does shrinking down the representation size so dramatically hurt the performance of the neural network? The authors ran experimentations and found out that as long as you implement this reduce layer with moderation, you can shrink down the representation size significantly without hurting the performance and saves a lot of computation. Now let’s put the reduce layers in action and build the new inception module with dimensionality reduction. To do that, we will keep the same concept of concatenating all the 4 layers that we had from the naive representation. We will add a 1x1 convolutional reduce layer before the 3x3 and 5x5 convolutions to reduce their computational cost. We will also add a 1x1 conv after the 3x3 max-pooling layer because pooling layers don’t reduce the depth for their inputs. So, we will need to apply the reduce layer to their output before we do the concatenation. See the diagram below:
Figure 5.18
We add dimensionality reduction prior to bigger convolutional layers to allow for increasing the number of units at each stage significantly without an uncontrolled blow-up in computational complexity at later stages. Furthermore, the design follows the practical intuition that visual information should be processed at various scales and then aggregated so that the next stage can abstract features from the different scales simultaneously.

5.4.4   Inception architecture

Okay, now that we understand the components of the inception module, we are ready to build the inception network architecture. In here, we are going to use the dimension reduction representation of the inception module and simply stack them on top of each other and add a 3x3 pooling layer in between for downsampling as you can see in the figure below.
Figure 5.19
In the figure above, we stacked two inception modules with a pooling layer in between. We can stack as many inception modules as we want to build a very deep convolutional network. In the original paper, the authors built a specific incarnation of the inception module and called it GoogLeNet. They used this network in their submission for the ILSVRC 2014 competition. The GoogLeNet architecture is depicted in the diagram below:
Figure 5.20   Full GoogLeNet Model
As you can see in the diagram above, GoogLeNet uses a stack of a total of 9 inception blocks and max pooling layer after every several blocks to reduce the dimensionality. To simplify this implementation, we are going to breakdown the GoogLeNet architecture into three parts A, B, and C:
  • Zrzt B: caltniide rv XefoDro bnc EoDor uratichetresc werhe jr soatcnin s sesrei lx YNGP snh VQDV ryasel
  • Frst R: ncatosin xjnn niiectnpo eodslmu setdkca cz wooflsl - 2 onicpiten lseuomd + lionpgo yerla + 5 nitenpoci msdluoe + pooignl eyral + 2 iinecnotp deolsmu
  • Vztr T: zj gro isaflsirec trgz xl kru ewrontk reehw jr cosntiss lk obr flylu eocenntdc zqn atsfoxm alsyer

5.4.5   GoogleNet in Keras

Now, let’s implement GoogleNet architecture in Keras. First we build the inception module function to use it in our architecture:
Figure 5.21
Notice that the inception module takes the features from the previous module as an input, passes it through 4 routes, then concatenate the depth of the output of all 4 routes, then pass the concatenated output to the next module. The four routes are as follows:
  1. 1o1 ksnk
  2. 1k1 okns + 3e3 ozkn
  3. 1o1 snkk + 5e5 sxen
  4. 3o3 kkqf + 1e1 nske
Now, let’s build the inception_module function. The function takes the number of filters of each convolutional layer as an argument and returns the concatenated output.
def inception_module(x, filters_1x1, filters_3x3_reduce, filters_3x3, filters_5x5_reduce,
                      Filters_5x5, filters_pool_proj, name=None):
    
       # create the 1x1 convolution layer that takes its input directly from the previous layer
 conv_1x1 = Conv2D(filters_1x1, kernel_size=(1, 1), padding='same', activation='relu', kernel_initializer=kernel_init, bias_initializer=bias_init)(x)
  
       # 3x3 route = 1x1 conv + 3x3 conv
 pre_conv_3x3 = Conv2D(filters_3x3_reduce, kernel_size=(1, 1), padding='same', activation='relu', kernel_initializer=kernel_init, bias_initializer=bias_init)(x)
 conv_3x3 = Conv2D(filters_3x3, kernel_size=(3, 3), padding='same', activation='relu', kernel_initializer=kernel_init, bias_initializer=bias_init)(pre_conv_3x3)
  
       # 5x5 route = 1x1 conv + 5x5 conv
 pre_conv_5x5 = Conv2D(filters_5x5_reduce, kernel_size=(1, 1), padding='same', activation='relu', kernel_initializer=kernel_init, bias_initializer=bias_init)(x)
 conv_5x5 = Conv2D(filters_5x5, kernel_size=(5, 5), padding='same', activation='relu', kernel_initializer=kernel_init, bias_initializer=bias_init)(pre_conv_5x5)
     
       # pool route = pool layer + 1x1 conv
 pool_proj = MaxPool2D((3, 3), strides=(1, 1), padding='same')(x)
 pool_proj = Conv2D(filters_pool_proj, (1, 1), padding='same', activation='relu', kernel_initializer=kernel_init, bias_initializer=bias_init)(pool_proj)
  
       # concatenate the depth of the 3 filters together
 output = concatenate([conv_1x1, conv_3x3, conv_5x5, pool_proj], axis=3, name=name)
  
 return output

GoogLeNet architecture:

Now that we have the inception_module function ready, let’s build the GoogLeNet architecture that we explained in the previous diagram. To get the values of the inception_module arguments, we are going to go through the table below that represents the hyperparameters setup as implemented by the authors in the original paper “Going Deeper with Convolutions paper”.
Figure 5.22
Note that “#3×3 reduce” and “#5×5 reduce” in the table below represent the 1×1 filters in the reduction layers that are used before the 3×3 and 5×5 convolutions. Now, let’s go through the implementations of parts A, B, and C.

Part A: build the bottom part of the network

Let’s build the bottom part of the network. This part consists of: 7x7 CONV > 3x3 POOL > 1x1 CONV > 3x3 CONV > 3x3 POOL as you can see in the diagram below.
Figure 5.23
LocalResponseNorm layer: similar to AlexNet, a local response normalization is used. It is a normalization technique that helps speed up the convergence. Nowadays, batch normalization (BN) is used instead of using local response normalization and we will be using BN in our implementation in this chapter.
# input layer with size = 24x24x3
 input_layer = Input(shape=(224, 224, 3))
  
 x = Conv2D(64, (7, 7), padding='same', strides=(2, 2), activation='relu', name='conv_1_7x7/2', kernel_initializer=kernel_init, bias_initializer=bias_init)(input_layer)
  
 x = MaxPool2D((3, 3), padding='same', strides=(2, 2), name='max_pool_1_3x3/2')(x)
  
 x = BatchNormalization()(x)
  
 x = Conv2D(64, (1, 1), padding='same', strides=(1, 1), activation='relu')(x)
 x = Conv2D(192, (3, 3), padding='same', strides=(1, 1), activation='relu')(x)
  
 x = BatchNormalization()(x)
  
 x = MaxPool2D((3, 3), padding='same', strides=(2, 2))(x)

Part B:

Yqqjf oneicpint melsodu 3z qsn 3y + dkr vzm ilgnopo yealr

type

#1x1

#3x3 reduce

#3x3

#5x5 reduce

#5x5

Pool proj

inception (3a)

64

96

128

16

32

32

inception (3b)

128

128

192

32

96

64

x = inception_module(x, filters_1x1=64, filters_3x3_reduce=96, filters_3x3=128, filters_5x5_reduce=16, filters_5x5=32, filters_pool_proj=32, name='inception_3a')
  
 x = inception_module(x, filters_1x1=128, filters_3x3_reduce=128, filters_3x3=192, filters_5x5_reduce=32, filters_5x5=96, filters_pool_proj=64, name='inception_3b')
  
 x = MaxPool2D((3, 3), padding='same', strides=(2, 2))(x)

Smiralyli, fro’z arctee sdlemuo 4c, 4q, 4a, 4y, 4x, cbn rvb skm lioognp yrlea:

x = inception_module(x, filters_1x1=192, filters_3x3_reduce=96, filters_3x3=208, filters_5x5_reduce=16, filters_5x5=48, filters_pool_proj=64, name='inception_4a')
  
 x = inception_module(x, filters_1x1=160, filters_3x3_reduce=112, filters_3x3=224, filters_5x5_reduce=24, filters_5x5=64, filters_pool_proj=64, name='inception_4b')
  
 x = inception_module(x, filters_1x1=128, filters_3x3_reduce=128, filters_3x3=256, filters_5x5_reduce=24, filters_5x5=64, filters_pool_proj=64, name='inception_4c')
  
 x = inception_module(x, filters_1x1=112, filters_3x3_reduce=144, filters_3x3=288, filters_5x5_reduce=32, filters_5x5=64, filters_pool_proj=64, name='inception_4d')
  
 x = inception_module(x, filters_1x1=256, filters_3x3_reduce=160, filters_3x3=320, filters_5x5_reduce=32, filters_5x5=128, filters_pool_proj=128, name='inception_4e')
  
 x = MaxPool2D((3, 3), padding='same', strides=(2, 2), name='max_pool_4_3x3/2')(x)
Now, let’s create modules 5a and 5b:
x = inception_module(x, filters_1x1=256, filters_3x3_reduce=160, filters_3x3=320, filters_5x5_reduce=32, filters_5x5=128, filters_pool_proj=128, name='inception_5a')
  
 x = inception_module(x, filters_1x1=384, filters_3x3_reduce=192, filters_3x3=384, filters_5x5_reduce=48, filters_5x5=128, filters_pool_proj=128, name='inception_5b')

Part D: the classifier part

In their experimentations, the authors found that adding an 7x7 average pooling improved the top-1 accuracy by about 0.6%. Then added a dropout layer with 40% probability to reduce overfitting.
x = AveragePooling2D(pool_size=(7,7), strides=1, padding='valid')(x)
 x = Dropout(0.4)(x)
 x = Dense(10, activation='softmax', name='output')(x)

5.4.6   Learning hyperparameters

The authors used stochastic gradient descent optimizer with 0.9 momentum. They also implemented a fixed learning rate decay schedule of 4% every 8 epochs.
epochs = 25
 initial_lrate = 0.01
  
 # implement the learning rate decay function
 def decay(epoch, steps=100):
     initial_lrate = 0.01
     drop = 0.96
     epochs_drop = 8
     lrate = initial_lrate * math.pow(drop, math.floor((1+epoch)/epochs_drop))
     return lrate
  
 lr_schedule = LearningRateScheduler(decay, verbose=1)
  
 sgd = SGD(lr=initial_lrate, momentum=0.9, nesterov=False)
  
 model.compile(loss=['categorical_crossentropy', 'categorical_crossentropy', 'categorical_crossentropy'], loss_weights=[1, 0.3, 0.3], optimizer=sgd, metrics=['accuracy'])
  
 model.fit(X_train, [y_train, y_train, y_train], validation_data=(X_test, [y_test, y_test, y_test]), epochs=epochs, batch_size=256, callbacks=[lr_schedule])

5.4.7   Inception performance on CIFAR dataset

GoogleNet is the winner of the ILSVRC 2014 competition. It achieved a top-5 error rate of 6.67% which was very close to human level performance and much better than the previous CNNs like AlexNet and VGGNet.

5.5   ResNet

Residual Neural Network (ResNet) was developed in 2015 by Kaiming He et al, from the Microsoft Research team, in their paper “ Deep Residual Learning for Image Recognition”. They introduced a novel architecture with “skip connections” called residual module. The network also features heavy batch normalization for the hidden layers. This technique allowed the authors to train a very deep neural networks with with 50, 101, and 152 weight layers while still having lower complexity than smaller networks like VGGNet (19 layers). ResNet was able to achieve a top-5 error rate of 3.57% in the ILSVRC15 which beats the performance of all prior convnets.

5.5.1   Novel features of ResNet

Looking at how neural network architectures evolved from LeNet, AlexNet, VGGNet, and Inception you might have noticed that the deeper the network, the larger learning capacity it has, and the better it extracts features from images. This mainly happens because very deep networks are able to represent very complex functions that allows the network to learn features at many different levels of abstraction, from edges (at the lower layers) to very complex features (at the deeper layers). Earlier in this chapter, we saw deep neural networks like VGGNet-19 that contains 19 layers and GoogLeNet that contains 22 layers. Both have performed very well in the ImageNet challenge. But can we build even deeper networks? We learned from chapter 4 that one downside from adding too many layers is that it makes the network more prone to overfit the training data. This is not a big problem because there are many regularization techniques that we learned in chapter 4 that we can use to avoid overfitting like dropout, L2 regularization, and batch normalization. So, if we take care of the overfitting problem, wouldn’t we want to build very deep networks that are 50, 100, or even 150 layers deep? The answer is, Yes. We definitely should try to build very deep neural networks. Only one other problem that we need to fix to unblock the capability of building super deep networks for us. It is a phenomenon that is called the vanishing gradients.   To solve the vanishing gradient problem, the authors created a shortcut that allows the gradient to directly backpropagated to earlier layers. These shortcuts are called “ skip connections”. The skip connections are used to flow information from earlier layers in the network to later layers creating an alternate shortcut path for the gradient to flow through. Another important benefit of the skip connections is that they allow the model to learn an identity function which ensures that the layer will perform at least as good as the previous layer.
Figure 5.24
The figure on the left is the traditional stacking of convolution layers together one after the other. On the right we still stack convolution layers as before but we now also add the original input to the output of the convolution block. This is called skip connection. We then add both signals; the skip connection + the main path. Note that the shortcut arrow is pointing to the end of the second convolutional layer. Not after. The reason is that you will do the addition of both paths before you apply the activation function ‘relu’ of this layer. It goes like this:
Figure 5.25
As you can see in the figure above, the X signal is passed along the shortcut path and then added to the main path f(x). Then, we apply the relu activation to f(x) + x to produce the output signal = relu( f(x) + x ) The code implementation of the skip connection is straight forward. Look at the code snippet below:
# You first store the value of the shortcut to be equal to the input x
 X_shortcut = X
  
 # Then perform the main path operations: CONV+ReLU + CONV
 X = Conv2D(filters = F1, kernel_size = (3, 3), strides = (1,1))(X)
 X = Activation('relu')(X)
 X = Conv2D(filters = F1, kernel_size = (3, 3), strides = (1,1))(X)
  
 # Then add the both paths together
 X = Add()([X, X_shortcut])
  
 # and finally, apply the relu activation function
 X = Activation('relu')(X)
This combination of the skip connection and convolutional layers is called the residual block. Similar to the Inception network, ResNet is composed of a series of building blocks that are stacked on top of each other. These building blocks are called residual blocks.
Figure 5.26
From the above diagram, you can observe the following:
  • Ptueaer ettarcsoxr: rx dbuli ord earfetu rxtctoeasr tsdr el rkb TvaQrv, wk tsatr rjwg s BDGP + FGDV eryla, xunr asckt ilreadus klbocs nk vrh lx abvs rhoet kr ibdul org nkweotr. Mnvd ow tkz genindigs pkt BcxDvr rnweokt, wx nsa pbs cz dzmn luesraid sbolck cc wk srnw re ilubd xnox prdeee koewtnrs.
  • Tsesfailris: vur toafcsiislacni qcrt ja siltl rob mcos ac xw elednra nj oreth tsrnkoew. Zfuhf-ceennctod esalyr delolwof pd s ofmstxa.
Now that you know what a skip connection is and you are familiar with the high-level architecture of ResNets, let’s unpack the residual blocks to understand how they work.

5.5.2   Residual blocks

The residual module consists of two branches:
  1. Scuhttro curb: hchiw ncectson dkr nitup vr cn dinidato el kqr dcsoen nbhacr
  2. Wncj rdzu: c eressi kl luivonosocnt yzn acnasttiiov. Ckg nzmj qdrs tnsoiscs lx 3 cuinoovatlnol erslya jqrw fdto itnviacatos. Mo fwjf sfce ygz bhcat onortmznaalii re dzav YDQF ayerl rx edurec teigtoifrvn nus seped qq nringtia. Cdo jncm sdrq trciurtahece soolk fxoj rjua: [YUQZ > YO > YkPD] e 3.
Figure 5.27
Similar to what we explained in the previous page, the shortcut path is added to the main path right before the activation function of the last CONV layer. Then we apply the ‘relu’ function after adding the two paths. Notice that the there are no pooling layers in the residual block. Instead, the authors of ResNet decided to do dimensions downsampling using bottleneck 1x1 convolutional layers similar to the Inception network. So, each residual block will start with a 1x1 CONV to downsample the input dimension volume + 3x3 CONV + another 1x1 CONV to downsample the output. This is a good technique to keep control of the volume dimensions across many layers. This configuration of the residual block is called the bottleneck residual block. Now, you are ready to start building your ResNet in code. One thing left. When you are stacking residual blocks on top of each other, the volume dimensions change from one block to another. And as you might recall from the matrices introduction in chapter 2, to be able to perform the matrix addition operations, the matrices should have similar dimensions. To fix this problem, we need to downsample the shortcut path as well before merging both paths. We do that by adding a bottleneck layer (1x1 CONV + BN) to the shortcut path as you see in the diagram below. This is called the reduce shortcut.
Figure 5.28
Before we jump in to the code implementation, let’s recap what we discussed in residual blocks:
  • Taludsei bckosl nionact xwr ptash: yrk outthrcs rsdg zun odr mnjs rzdq
  • Bvy jmnc sryq stincoss xl heetr YNQL eayrsl nbz ow pbs s tabch tnem alery vr rmgv
  • 1k1 kksn
  • 3o3 vona
  • 1e1 zone
  • Ytxvq xtc rkw cwbs xw sns mtempneil ykr thsotruc gdrs:
  • Tgruael cruthtso: nj hhcwi wv gzir qzp ruo piunt niimendsos rv xyr nmjs sypr
  • Xecdue tcurhots: nj ihwch wo zuq c AQKZ eayrl nj pro trtoscuh qrsq fboere giegrmn rjwp kry znmj syry
When we are implementing the ResNet network, we will use both regular and reduce shortcuts. This will get clearer when you see the full implementation soon. But for now, we will implement bottleneck_residual_block function that takes a boolean argument reduce. When reduce = True, this means we want to use the reduce shortcut, else it will implement the regular short cut. The function takes the following arguments:
  • R -- ipnut tosern kl sepah (urebnm vl ssamepl, getihh, hdiwt, cnaenhl)
  • l -- engteir, iegnfpcyis org paehs le rgx dmidle ADDP'z idwnow lvt gro mznj grhs
  • tlfsier -- pytonh fjar el srgeinet, deinngfi xru eubnrm xl istlrfe jn vrd YUKL ersaly lv kdr zjmn rzyh
  • drucee -- eanbool, Ytxb = tfedneiisi rou enduiorct eryla
  • c -- iegrten, sditsre
      
And returns:
  • C -- uopttu vl rvu ralsdieu bclko, otenrs lx ahpes (hgtihe, hwdti, achenln)
def bottleneck_residual_block(X, kernel_size, filters, reduce=False, s=2):
     # unpack the tuple to retrieve Filters of each CONV layer
     F1, F2, F3 = filters
    
     # Save the input value to use it later to add back to the main path.
     X_shortcut = X
    
     # if condition if reduce is True
     if reduce:
         # if we are to reduce the spatial size, apply a 1x1 CONV layer to the shortcut path
         # to do that, we need both CONV layers to have similar strides
        X_shortcut = Conv2D(filters = F3, kernel_size = (1, 1), strides = (s,s))(X_shortcut)
         X_shortcut = BatchNormalization(axis = 3)(X_shortcut)
        
 # if reduce, we will need to set the strides of the first conv to be similar to the shortcut strides
        X = Conv2D(filters = F1, kernel_size = (1, 1), strides = (s,s), padding = 'valid')(X)
         X = BatchNormalization(axis = 3)(X)
         X = Activation('relu')(X)
        
     else:
         # First component of main path
         X = Conv2D(filters = F1, kernel_size = (1, 1), strides = (1,1), padding = 'valid')(X)
         X = BatchNormalization(axis = 3)(X)
         X = Activation('relu')(X)
    
     # Second component of main path
     X = Conv2D(filters = F2, kernel_size = kernel_size, strides = (1,1), padding = 'same')(X)
     X = BatchNormalization(axis = 3)(X)
     X = Activation('relu')(X)
  
     # Third component of main path
     X = Conv2D(filters = F3, kernel_size = (1, 1), strides = (1,1), padding = 'valid')(X)
     X = BatchNormalization(axis = 3)(X)
  
     # Final step: Add shortcut value to main path, and pass it through a RELU activation
     X = Add()([X, X_shortcut])
     X = Activation('relu')(X)
  
     return X

5.5.3   ResNet implementation in Keras

Alright, we’ve learned a lot about residual blocks so far. Let’s add these blocks on top of each other to build the full ResNet architecture. In this chapter, we will implement ResNet50. It is a version of the ResNet architecture that contains 50 weight layers, hence the name ResNet50. You can use the same approach to develop ResNet with 18, 34, 101, and 152 layers by following the architecture in the table below from the Deep Residual Learning for Image Recognition paper.
Figure 5.29
We know from the previous section that each residual module contains 3 x CONV layers, we now can compute the total number of the weight layers inside the ResNet50 network as follows:
  • Sqvsr 1: 7k7 BNOZ ayerl
  • Ssdrx 2: 3 slauerid soclkb, zvqs niotnca 1k1 AKKP + 3k3 BNKP + 1o1 BNUE] = ottal lx 9 XQKL relsya
  • Szobr 3: 4 delaursi slbock = talot el 12 XUKP yrlaes
  • Srsop 4: 6 seldirau lcsbko = aotlt lk 18 nnltivuloocoa ryeasl
  • Szvrp 5: 3 lseduira koclsb = tolta lv 9 atnnvcllooiou srlyea
  • LA Stmfaox elary
When you sum all these layers together, you will get a total of 50 weight layers that describe the architecture of ResNet50. Similarly, you can compute the number of weight layers in the other ResNet versions. Now let’s follow the 50-layer architecture in the table above to build the ResNet50 network. We will build a ResNet50 function that takes the input_shape and classes as arguments and outputs the model.
def ResNet50(input_shape, classes):
     # Define the input as a tensor with shape input_shape
     X_input = Input(input_shape)
  
     # Stage 1
     X = Conv2D(64, (7, 7), strides=(2, 2), name='conv1')(X_input)
     X = BatchNormalization(axis=3, name='bn_conv1')(X)
     X = Activation('relu')(X)
     X = MaxPooling2D((3, 3), strides=(2, 2))(X)
  
     # Stage 2
     X = bottleneck_residual_block(X, 3, [64, 64, 256], reduce=True, s=1)
     X = bottleneck_residual_block(X, 3, [64, 64, 256])
     X = bottleneck_residual_block(X, 3, [64, 64, 256])
  
     # Stage 3
     X = bottleneck_residual_block(X, 3, [128, 128, 512], reduce=True, s=2)
     X = bottleneck_residual_block(X, 3, [128, 128, 512])
     X = bottleneck_residual_block(X, 3, [128, 128, 512])
     X = bottleneck_residual_block(X, 3, [128, 128, 512])
  
     # Stage 4
     X = bottleneck_residual_block(X, 3, [256, 256, 1024], reduce=True, s=2)
     X = bottleneck_residual_block(X, 3, [256, 256, 1024])
     X = bottleneck_residual_block(X, 3, [256, 256, 1024])
     X = bottleneck_residual_block(X, 3, [256, 256, 1024])
     X = bottleneck_residual_block(X, 3, [256, 256, 1024])
     X = bottleneck_residual_block(X, 3, [256, 256, 1024])
  
     # Stage 5
     X = bottleneck_residual_block(X, 3, [512, 512, 2048], reduce=True, s=2)
     X = bottleneck_residual_block(X, 3, [512, 512, 2048])
     X = bottleneck_residual_block(X, 3, [512, 512, 2048])
  
     # AVGPOOL
     X = AveragePooling2D((1,1))(X)
  
     # output layer
     X = Flatten()(X)
     X = Dense(classes, activation='softmax', name='fc' + str(classes))(X)
    
     # Create the model
     model = Model(inputs = X_input, outputs = X, name='ResNet50')
  
     return model

5.5.4   Learning hyperparameters

The authors followed a similar training procedure to AlexNet. Namely, the training is carried out using mini-batch gradient descent SGD with momentum = 0.9. They set the learning rate to start with value = 0.1 , and then decreased by a factor of 10 when the validation error stopped improving. They also used L2 regularization with weight decay of 0.0001 that is not implemented in this chapter for simplicity. As you saw in the implementation above, they used batch normalization (BN) right after each convolution and before activation to speed up training.
from keras.callbacks import ReduceLROnPlateau
  
 # set the training parameters
 epochs = 200
 batch_size = 256
  
 # min_lr: lower bound on the learning rate
 # factor: factor by which the learning rate will be reduced
 reduce_lr= ReduceLROnPlateau(monitor='val_loss',factor=np.sqrt(0.1),patience=5, min_lr=0.5e-6)
  
 # compile the model
 model.compile(loss='categorical_crossentropy', optimizer=SGD, metrics=['accuracy'])
  
 # train the model
 # call the reduce_lr value using callbacks in the training method
 model.fit(X_train, Y_train, batch_size=batch_size, validation_data=(X_test, Y_test), epochs=epochs, callbacks=[reduce_lr])

5.5.5   ResNet performance on CIFAR dataset

Similar to the other networks explained in this chapter, the performance of ResNet models are benchmarked their results in the ImageNet challenge (ILSVRC). ResNet-152 won the first place in the 2015 classification competition with top-5 error rate of 4.49% with a single model and lowered to 3.57% using an ensemble of models. Which is much better than all the other networks like GoogLeNet (Inception) that achieved a top-5 error rate of 6.67%. ResNet also won the first place in many object detection and image localization challenges as we will see the in Object Detection chapter in this book. More importantly, the residual blocks concept in ResNet opened the door to a whole new possibilities to efficiently train super deep neural networks with hundreds of layers.

5.6   Summary and takeaways

  • Jn jyar peacrth, kw’vx edlxneapi yor rowketn etcaruhtecirs el eojl luappro XDD rweostnk. Agx acisclsal AGO eehccsitrtaur: EoGor, YfvvDrv, yzn FUQGrv, unz kgr cdvneaad eictruhertasc: BkcUro, npc Jicopnnte (QlogeoDro).
  • Aaiclalss TGQ eectrsiuacrht xyce rbx czmx aallcciss cteracitheru le gsckinta uoloitvcnon nqc nooigpl arlyes ne erq xl spsx oethr ywrj nfeidertf ficisornntaguo lkt heirt lsreay.
  • EvUrk: ocsinsst kl 5 gthiew srleya; 3 vunllooociatn ncy 2 ufyll-cocenndte yelrsa jrwy c ioonlgp yarle faert grx siftr qcn oescdn ctlonlvioaoun ersyal.
  • YvkfQro: jr zj erepde rpns ZkKkr herew rj sncinoat 8 iwgteh saryel; 5 vaolulooitnnc unc 3 fllyu-odennectc yasrel. BfevUrk pocg rraelg reftil izses 11v11, 5e5, bnc 3o3 ncu pngolio aslyer retaf orb itfrs, odnecs zbn ruk fihtf tcnovuinloola lraeys.
  • PKNUkr: jr olsved pxr opelmbr xl tnisgte yq bor eymerertppsaahr vl rbk nzex. nzh lgonoip eylasr gd acnretgi z xne rmfniuo iunioofcanrgt tkl rgxm vr oh apqo arossc rkq etrnie wkneotr. Yff anotnivullooc reaysl jn ukr ZNN ewnrtko zvkg 3o3 ifterl kszj jprw stdire = 1 znp adinpdg = ‘saem’. Bzfk fzf pingool lyrase dsoo z vbxf cjao lk 2k2 gnc s tseidr = 2. Bxp qojz nbegi, using 3e3 AGDZ ayrsle jrwb agrerl pdteh srfermpo brette ycrn zcfo graler elfrist xxjf 5o5 zny 11v11.
  • Mv fszv icusddses rdx drtfenfie LNQUor sagniuotnfciro jfxx PDO-16 crpr noncatis 16 ithewg aryels zgn EOO-19 zurr aj 19 yarsle qxux.
  • Bcdvedna XGG ireurschcttea: Mv nxrq nrow noebyd gvr sacclsi kwenrsto gns uisedscds wvr moxt aedadcnv, xxno tekm frplwueo nluaer kwetorn seretuicracth: Jonipecnt nsb XcvGro.
  • Jnpecntoi: dtire rk seolv rvp msco obelmpr drrs FUDKkr zj novsilg. Jr czw tnyigr er zxzo sff rxd tialr qns rreor twxe re froieugcn ogr lolountvianco nuz iologpn aseyrl. Jnasetd el vnigah rk edcdei which flteir cjxc rv dcx spn erweh er uzp brx lngoiop ryale, Jnnopceit oewknrt zhcz: “Zvr’c avp qrom cff”. Ykb hratsuo etcread nz nnepicoti dmleuo yrcr ancinsot treeh oniolutvancol arlyes urjw tfirle ssezi le 1v1, 3o3, sqn 5v5 jn ntiodiad xr z 3e3 goiopln ayrel. Yvg tutspou toc yxrn ccenotaetand tgareinc z nlisge uelvom qrzr jffw vh oul kr rbk knrv neicntoip oumlde.
  • TcvDrk: rj oewlflod oqr cvsm acprahop cc Jtcneopin nzu edceart delasrui lcoskb sdrr onpw tkscead xn rde el qcks hrote, ypvr vmlt ukr roetnwk icrrthaceute. TocKkr etedmptat rv slveo org niaihsnvg dingtera eblpmor rryc msuo rgv naenlirg sauletpa tx sargdeed wbnv rgitnina ketq ukkb eanurl sokrwetn. Bgo rhuosta eordtidnuc obr “zojh scnncneotio” ojsu ursr lsawlo aiitofnrnom re fxlw lmte reralie rslyae jn rky wtoekrn kr alret aelysr tiraecgn nz ealtrneta uthtsocr rbcg elt vyr aidrgtne xr fwel ghrtohu. Rbv amfatdnenlu uraghrbtoekh jryw CakQrk swa ryrz rj dleowla yz vr aitnr etemrlyxe okpu eunarl owstnkre drwj eurddhns le learys. Lttjv vr BvcOrx nraiitgn htoe vxgu nurlea rsnktwoe ccw idcufilft xhy rv pro mrbeolp el vhasiginn rdiegtsna.
Below is a summary of the classical and advanced networks:
Table 5.2

Year

CNN

Number of layers

Top-5 error rate

Number of parameters

1998

LeNet

5 layers

NA

60 thousand

2012

AlexNet

8 layers

15.3%

60 million

2014

VGGNet

16 layers

7.3%

138 million

2014

Inception (GoogLeNet)

22

6.67%

12 million

2015

ResNet-152

152

4.49%

 


sitemap

Unable to load book!

The book could not be loaded.

(try again in a couple of minutes)

manning.com homepage