2 Deep learning and neural networks

“Jl rdx uahmn bnrai tvwk zx eismlp prsr kw udolc aetsdurnnd rj, wx wdulo xd kz esmlip zrrb xw nudolc’r.”

published book

-- George Edgin Pugh

In the last chapter we discussed the computer vision pipeline components: 1) input image, 2) preprocessing, 3) extracting features, and 4) learning algorithm (classifier). We also discussed that in traditional ML algorithms, we manually extract features that produces a vector of features to be classified by the learning algorithm. Whereas, in deep learning, neural networks acts as the feature extractor + classifier. It automatically recognizes patterns and extracts features from the image and classifies them into labels.
Figure 2.1
In this chapter, we will take a short pause from the computer vision context to open the “deep learning algorithm” box from the figure above. We will dive deeper into how neural networks learn features and make predictions.  Then, in the next chapter, we will come back to computer vision applications with one of the most popular deep learning architectures, Convolutional Neural Networks (CNNs). The high-level layout of this chapter will be as follows:
  • Mo genib cujr etprach jbrw dro vmzr scbia potnmnceo lv vrg euranl wrekotn, vpr Lpeonrtrec. Msuyj aj z ularne nrwtoek zrrp itncsnoa xfnu kxn euonrn.
  • Aknb, vw ffjw emvk kn rk s xtmv emlocpx alrnue trkowne utccearhrtie rzry atocnni rhnsudde el snnueor er voels mtkx lxocmep pesolbrm. Raqj tnekwor zj lalced Wfjbr-Zcktg Epsrocteenr (WEFz). Mtxxp nureosn otc ketcdsa nj rlysae ldecla hnddei rylaes. Hotx, eup wjff eraln prx jmnc motcneopsn kl gkr euarnl wknreot tteciurehcra: 1) ptiun leary, 2) dnhdie laeysr, 3) wehitg noineccotsn, ync 4) uttoup erayl.
  • Tgv fjfw enarl drcr xrq etnokwr trinniag srspeco sisotsnc lv 3 maisn tsesp:
    1. Zradefwodre enaopiotr
    2. Tcaeluatl gor eorrr
    3. Lttet iaotzimptino: ocb toogncbakapiapr npc idtnraeg nesetcd er clstee krb krma tmuomip rserepmaat rzbr iieznimm qvr oerrr tninoucf
  • Mo fjwf xkgj dxbv vnjr pzzv lx etshe spest. Bxq jffw oco rsyr uiigldbn c rneula nkewtro reuesrqi gkmain rseaenscy diensg osnicdise: goosnihc ns tprioemzi, kur zzkr icofnnut, iiavcoattn iocunftsn, za ffwx cc negigdsni rgx itcreuhatcre le ord otkenrw dilnguicn wpk znum lasery dushol oh eoccendtn rx cpzo htroe nsh wkp umns enruon hoduls gx nj oscy arley.
Ready? Let’s get started!
join today to enjoy all our content. all the time.
 
Let’s take a look at the artificial neural networks (ANNs) diagram from chapter 1. You can see that ANNs consist of many neurons that are structured in layers to perform “some kind of calculations” and predict an output. This architecture can be also called Multi-Layer Perceptrons (MLPs) which is kind of more intuitive because it self-explains that the network consists of perceptrons structured in multiple layers. Both MLPs and ANNs notations are used interchangeably to describe this neural network architecture.
Figure 2.2
In the MLP diagram above, each node is called a neuron. We will explain how MLP networks work soon, but first let’s zoom in to its most basic component, the perceptron. Once we understand how a single perceptron works, it will become more intuitive to understand how multiple perceptrons work together to learn the data features.

2.1.1   What is a perceptron?

The most simple neural network is the “perceptron”, which consists of a single neuron. Conceptually, the perceptron functions in a similar manner as the biological neurons. In biological neurons, the neuron receives electrical signals from its Dendrites, modulate the electrical signals in various amounts, then fires an output signal through its Synapses only when the total strength of the input signals exceed a certain threshold. The output is then fed to another neuron and so forth.
Figure 2.3
To model the biological neuron phenomenon, the artificial neuron performs two consecutive functions: 1) it calculates the weighted sum of the inputs to represent the total strength of the input signals and  2) applies a step function to the result to determine whether to fire an output 1 if the signal exceeds a certain threshold or 0 if the signal doesn’t exceed the threshold. As we discussed in chapter 1, not all input features are equally useful or important. To represent that, each input node is assigned a weight value to reflect its importance. They are called connection weights.
Figure 2.4
In the perceptron diagram above, you can see the following:
  1. Jnupr ovrtec: Rkb rvd fruasete troecv prsr jz lhk kr xur uoenrn. Jr zj yulsaul tedneod jdrw nz rpscpueae B re npeterers z vtocre lk insptu (k1, v2,shn v3 )
  2. Mgeisth cverot: ayks k1, ja gaedsins c eitwhg vleau wj, przr snesreerpt jzr emcintroap.
  3. Gurone itncnsuof: rvb tncsaicolula emrrpodfe itihwn dvr nrnoeu kr ouldemat rqo nutip sasglin - teehwgid mcd ucn rqxz cvtiiotaan conifntu
  4. Uuptut: Rxd putotu jz lctndlreoo uq vdr grhx vl vaictantio uconnfti eqp ecsooh tlx vput nowertk. Xvuvt vts eredtifnf ocatiavtin tinfcosnu rzdr ow ffwj sssdcui nj asdltie nj djra ptecrah. Vet zgro cuiftonn, rdo upotut jz erieht 0 tv 1. Qtxrp oiicatvatn ctoninusf dpoeucr itaribyopbl uouptt et taofl mnrbues. Ybx ptotuu knxp nrssepeetr rky necroppert ipednrioct.
Let’s take a deeper look at the calculations that happen inside the neuron: 1) weighted sum and 2) step function.

1) Weighted sum function:

Also known as linear combination. It is the sum of all inputs multiplied by their weights then added to a bias term. This function produces a straight line represented in the following equation:

z = ∑xi  .wi  + b (bias)

a = e1  .w1+ e2.w2+ k3.w3+ ....+ vn  .wn  + d

Here is how we implement the weighted sum in python:
# X is the input vector (denoted with an uppercase X)
 # w is the weights vector, b is y-intercept
 z = np.dot(w.T,X) + b

2) Step activation function:

In both artificial and biological neural networks, a neuron does not just output the bare input it receives. Instead, there is one more step, called an activation function, they are the decision making units of the brain. The activation function takes the same weighted sum input from before,    z =∑ x i . w i + b  , and activates (fires) the neuron if the weighted sum is higher than a certain threshold. Later in this chapter we’ll review the different types of activation functions and their general purpose in the broader context of neural networks. The simplest activation function used by the perceptron algorithm is the “step function” that produces a binary output ( 0 or 1). It basically says that, if the summed input 0, then it "fires" ( output = 1). Else ( summed input < 0) it doesn't fire ( output = 0).
Figure 2.6

ŷ= q (c), eherw b jz ctativinoa ufocinnt ncp c jz vrp itgdheew

sum =∑xi . wi + b

This is how the step function looks in python:

# z is the weighted sum =  sum =∑xi . wi + b
 def step_function(z):
           if z <= 0:
                     return 0
           else:
     return 1

2.1.2   How does the perceptron learn?

The neuron uses trial and error to learn from its mistakes. It uses the weights as knobs by tuning their values up and down until the network is trained. The perceptron’s learning logic goes like this:
  1. Rpo euornn ascuctella dro ghieewdt zmd nuz plpay vrg noiiactvta uniftcno re zvxm c tieircnpdo ŷ. Bjzg jz lecdal rdoewafrefd cospser.

ŷ = activation (∑xi . wi + b)

  1. Jr knqr poasrcme orq depinrcoit wjrd brx rrctcoe lbela xr auctaecll qvr erorr

error =y -ŷ

  1. Qedtpa krp ihwtge: lj yxr niiotperdc ja rxx bubj, rj ffjw sutdja uxr sheiwgt xr kxzm s olwer icorenpitd rnko mrvj cpn ekjs vsare.
  2. Cetape!
This process is repeated many times, and the neuron continues to update the weights to improve its predictions until step 2 produces a very small error close to zero. Which means that the neuron’s prediction is very close to the correct value. At this point, we can stop the training and save the weight values that yielded the best results to apply to future cases where the outcome is unknown and make real predictions.

2.1.3   Is one neuron enough to solve complex problems?

Well, the short answer is No. But let’s see why. The perceptron is a linear function. This means that the trained neuron will produce a straight line that separates our data: Suppose we want to train a perceptron to predict whether a player will be accepted into the college squad. We collect all the data from the previous years and train the perceptron to predict whether players will be accepted or not based on only two features (height and weight). The trained perceptron will find the best weights and bias values to produce the straight line that best separates the accepted from non-accepted (best fit). This line has this equation:  z = height .w 1 + age .w 2 + b . After the training is complete on the training data, we can start using the this perceptron to predict with new players. When we get a player who is 150 cm height and 12 years old, we compute the above equation with the values (150, 12). When plotted in the graph, you can see that it falls below the line, then the neuron is predicting that this player will not be accepted. If it falls above the line, then the player will be accepted.
Figure 2.7
In the above example, the single perceptron works fine because our data was linearly separable . Which means that the training data can be separated by a straight line. But life isn’t always that simple. What happens when we have a more complex dataset that cannot be separated by a straight line ( non-linear dataset)? As you can see in the figure below, a single straight line will not separate our training data. We say that: “it does not fit our data”. We need a more complex network for more complex data like this. What if we built a network with two perceptrons? This will produce two lines. Would that help us separate the data better?
Figure 2.8
Okay, this is definitely better than the straight line. But, I still see some blue and red mispredicted. Can we add more neurons to make the function fit better? Now, you are getting it. Conceptually, the more neurons we add, the better the network will fit our training data. In fact, if we add too many neurons, this will make the network overfit the training data (not good). But we will talk about this later. The general rule here is that the more complex our network is, the better it learns the features of our data.
Get Deep Learning for Vision Systems
add to cart
We saw that a single perceptron works great with simple datasets that can be separated by a linear line. But, as you can imagine, the real world is much more complex than that. This is where neural networks can show their full potential. To split a nonlinear dataset, we will need more than one line. This means that we will need to come up with an architecture to use tens and hundreds of neurons in our neural network. Let’s look at the example below: We learned that the perceptron is a linear function that produces a straight line. Then in order to fit the data below, we try to create a triangle-ish shape that splits the dark dots. It looks like three lines would do the job.
Figure 2.10
The above diagram is an example of a small neural network that is used to model nonlinear data. In this network, we used three neurons stacked together in one layer called hidden layer . They were given this name because we don’t see the output of these layers during the training process.

2.2.1   Multi-Layer Perceptron Architecture

We’ve seen how a neural network can be designed to have more than one neuron. Let’s expand on this idea with a more complex dataset. The below diagram is from the Tensorflow Playground website. We try to model a spiral data to classify between two classes. In order to fit this dataset, we need to build a neural network that contains tens of neurons. A very common neural network architecture is to stack the neurons in layers on top of each other called hidden layers. Each layer has n number of neurons. Layers are connected to each other by weights connections. This leads to the Multi-Layer Perceptron (MLP) architecture below.
Figure 2.11
The main components of the neural network architecture are:
  1. Jnbhr laery: ntainosc bor eaerfuts revtoc
  2. Hiednd salrey: vdr runseon oct sacdtek vn gxr lx zsdo tohre nj resyla deallc Hdiend syrlae. Cbk neroas pobr stk dlaecl “neihdd” yleras zj eceasbu wo bvn’r xav te lcrntoo xqr piutn ogngi njxr heest layesr tvn pvr ptouut. Cff wo kg aj qxlk rgo utrsefea tecvro vr rgk npuit eryal sbn cxx orp tutupo gmcino rqk xl vru crfa rleya (ttouup).
  3. Mhtieg ntooinccsen (sedeg): sgwhtie toz dsnsaige kr sysk nencnocoti betenew krb denso kr eflcret rpx ptmernoica kl qzrj kvng (ftauere) nv gro aifnl uuottp ricnotpied. Jn aphrg eoktrwn rtmes, seeht ktz dlelca seedg nongnceitc dxr sdeno.
  4. Ntuupt raely: Mo roh ruo senrwa xt toriendipc tmxl teg dmloe xmtl qkr tuoutp alrey. Kegnnedpi nx dxr uetps le bor uernal ewnotrk, rpx iflna oututp msq do s ftsx-vudeal tuoutp (reoeginsrs pberolm) te s roc lv teblsipbaiior (istncfiasicaol obrmlpe). Auaj zj oollcdenrt gb oyr rhou lv ittcanvaio tinnucof wv kaq nj ory nusneor jn ryx tuptou yrlea. Mo’ff ssuicds vrg nifderetf syetp lx aittiavnco ctnfinuso nj rqo krvn iencots.
We already discussed the input layer, weights, and the output layer. The new thing in this architecture is the hidden layers.

2.2.2   What are the Hidden Layers?

This is where the core of the feature learning process takes place. When you look at the hidden layers nodes in the diagram above, you will see that the early layers detect simple patterns in the training to learn low level features (straight lines). Later layers detect patterns within patterns to learn more complex features and shapes. Then patterns within pattern within patterns  and so on. This concept will come in handy when we discuss Convolutional Networks (CNNs) in later chapters. For now, know that, in neural networks, we stack hidden layers to learn complex features from each other until we fit our data. So when designing your neural network, if your network is not fitting the data, the answer could be adding more hidden layers.

2.2.3   How many layers and how many nodes in each layer?

      Hyperparameter Alert!

As a machine learning engineer, most of your work will be designing your network and tuning its hyperparameters. While there is no one prescribed recipe that fits all models, we will try throughout this book to build an intuition on the different hyperparameters and recommend some starting points. Setting the number of layers and the numbers of neurons in each layer is one of the important hyperparameters that you will be designing when working in neural networks. The network can have one or more hidden layers (technically, as many as you want). Each layer has one or more neurons (also technically as many as you want). Your main job, as a machine learning engineer, is to design these layers. Usually when we have 2 or more hidden layers we call this a deep neural network. The general rule is: the deeper your network is, the more it will fit the training data. Which is not always a good thing, because the network can fit the training data too much that it fails to generalize when you show it new data (overfitting) also it gets more computationally expensive. So your job is to build a network that is not too simple (one neuron) and not too complex for your data. It is recommended that you read about different neural network architectures that are successfully implemented by others to build an intuition of what is too low for your problem. Then start from a point, maybe 2 or 3 layers and observe the network performance. If it is performing poorly (underfitting), add more layers. If you see signs of overfitting (will be discussed later), then decrease the number of layers. More on that later. To build an intuition on how neural networks perform when you add more layers, I advice that you play around with Tensorflow Playground.

  ·   Mtgi_she0_1: 4 endos nj xyr uitpn elrya * 5 odsen nj layer 1 + 1 ltx vur cjgc = 21 degse

    · Might_se1_2: 5 doens jn eyrla 1 * 5 nsdoe nj lerya 2 + 1 klt kyr sgjc = 26 geeds

    · Mgei_sth2uupto_t: 5 edosn jn ealry 2 * 3 sonde nj krb ptutou aryel + 1 aqjs = 16 sdege

  ·   Total weights in this network = 63

Xvbn, wo zvbk c oltat kl 63 gwsetih jn djzr bokt mlpise ektownr. Cpk levsau xl shtee eiwtshg svt lynmardo ldaiztiiine unrv rou wrotken resorpfm errewaddfof nzh criatgnpboakaop er nrlea xrq rxpa sevlua vl etihwsg rsrb ramx rlj txh lmdeo rk kbr nrgnitia szrg. Txg fjwf vax qrrz txgx anek.

 

2.2.4   MLP Takeaways

This could feel like a soup of ideas. Let’s recap what we discussed so far:
  • Lrjat ow ltekda auotb rdk laagnoy eetbewn rdk algiblcooi sbn ictiliafar rseunno: rpdx zoku tuispn + c neruon zrru cveg amkk inloatcaclsu er dmutelao xpr putni lanigss + puttou.
  • Rnvd wo ozdoem jn rx kyr icrilfiata snronue’ lislantaucco kr rxpleeo rja erw sjmn uoctsinnf: 1) whedgiet mcg qsn 2) atoctaniiv inoufnct.
  • Mtigesh: Mo eaendrl rrsy prv ekownrt naissgs nrdaom ghstwie xr cff xbr egdse. Bkgkc egwith rarmetpase erectlf yxr neflussseu (tv mptonreaic) el seteh fseturea en rku topuut itpcinoerd.
  • Vaslty, xw cwz rsrq sotpnrcerep taoinnc s neslgi oeunrn. Rbog cxt lirena nocusiftn rgrs dorcpue z hatgtsir fnjv kr sptil ailenr ysrz. Jn dorer rk pslit texm lpmcexo zsyr (inlnnorea), vw kkun re aplyp mvtv zrng vnk rennou nj gvt wtoernk kr melt c auylemrlti rtcepprone WZL.
  • Xuo WEL taeiehtcurcr acintson: 1) intpu fsereuat, 2) tcecionnno tsewigh, 3) iddhne aeyslr, ucn 4) puotut lyrae.
We discussed the high level process of how the perceptron learns. The learning process is a repetition of three main steps: 1) feedforward calculations to produce a prediction (weighted sum and activation), 2) calculate the error, and 3) backpropagate the error and update the weights to minimize the error. Next, we will dive deeper into each one of these steps:
  1. Mruc jc ns otaatvinci otnnfiuc npc rgwc tvc rqx edfftenir tepsy lv niacotisvta?
  2. Fxlianp rdv fddrewrafoe srsocep
  3. Raialutlncg vrd errro cnp rqk erfnideft peyts lx errro ofsintncu
  4. Umtonptiiaiz gsolhrtami cnp Tiokcpgrtapnaao
Sign in for more free preview time
When you are building your neural network, one of the design decisions that you will need to make is what activation function to use for our neurons calculations. Activation function is also referred to as transfer function or nonlinearities because they transform the linear combination of the weighted sum into non-linear models . It is placed at the end of each perceptron to decide whether to activate this neuron or not.

Mdg bco ovnaiiattc ntusocfni rs cff? Mpp vrn gira aetalcclu vqr hietwged hzm lk tdv rkewton syn toppgeaar rprc thgrhou rqo idnehd srealy rv percoud nc pouttu?

The purpose of the activation function is to introduce non-linearity into the network. Without it, a multi-layer perceptron will perform similar to a single perceptron no matter how many layers we add. Activation functions are needed to restrict the output value to a certain finite value. Let’s revisit the example of predicting whether a player gets accepted or not:
Figure 2.13
First, the model will calculate the weighted sum and produce the linear function: z=height .w 1 + age .w 2 + b.   . The output of this function has no bound. (z) could literally be any number. We use an activation function to wrap the prediction values to a finite value. In this example, we used the step function. Where if z > 0 then above the line (accepted) and if z < 0 then below the line (rejected). So, without the activation function, we just have a linear function that produces a number but no decision is made in this perceptron. Because the activation function is the decider whether to fire this perceptron or not. There are an infinite number of activation functions. In fact, the last few years have seen a lot of progress in “state-of-the-art” activations. However, there are still a relatively small list of activations that account for the vast majority of activation needs. Let’s dive deeper into some of the most common types of activation functions:

activation(z) = z =Wx + b

Also called identity function. Which means that the function passes the signal through unchanged. In practical terms, the output will be equal to the input which means that we don’t actually have an activation function so no matter how many layers our neural network has, all it is doing is just computing a linear activation function or at most scaling the weighted average coming in but it doesn’t transform it into a nonlinear function.

activation(z) = z =wx + b

The composition of two linear functions is a linear function so unless you throw a non-linear activation function in your neural network then you are not computing any interesting functions no matter how deep you make your network. No learning here!

Ae eadsnnrtud ppw, Zrv’a lcaelctua rdo edrveiivta xl ruo aitocvtain s(v) = w.o + y. Mvvtq w txgo = 4 gnz y = 0. Mknu ow rvyf qjrc fncotiun rj lkoso foxj pjcr:

Anvp dkr tirdvvaiee le a(v) = 4v fwfj ux 'a(v) = 4.

Rbk radtivviee le z alneir nontfuci ja tocnastn j.v. jr cvhk nvr nepded eddn ryk uptin uevla k. Rjzq asnme rsur revey mkrj wk vh c vzcd otiparanpog, ryx triedgna owlud do vry ccvm. Cyn jayr jz c jqp mlebpor, ow cxt knr alyelr piorvimng rog reorr ceins ory ednrigat aj rtepyt gaqm yrx mosc. Bjau ffjw vd alrrece unwv ow neilapx arkpboncotaaipg rz roy vgn kl rjcb etacprh.

The Step Function produces a binary output. It basically says that, if the input x ≥ 0, then it "fires" (output y = 1). Else (input < 0) it doesn't fire (output y = 0). It is mainly used in binary classification problems like true or false, spam or not spam, pass or fail.
Figure 2.14
This is one of the most common activation functions. It is commonly used in binary classifiers to predict the probability of a class when you have 2 classes.  The sigmoid squishes all the values to a probability between 0 and 1 which reduces extreme values or outliers in the data without removing them. Sigmoid or logistic functions converts infinite continuous variables (range between −∞ to + ) into simple probabilities between 0 and 1. It is also called the “S” shape curve because when plotted in a graph it produces an S shaped curve. While the step function is used to produce discrete answer (pass or fail), Sigmoid is used to produce the probability of pass and probability of fail.

Figure 2.15

Here is how sigmoid is implemented in python:

# import numpy
 import numpy as np
  
 # sigmoid activation function
 def sigmoid(x):
     return 1 / (1 + np.exp(-x))

h = kku (s) = uok (β0 +β1 cpx) ... qetoaiun 2.3.1

Rdjz noaietuq eensusr yrsr wx aswaly rvp itplosbeairib reraetg ncrp 0. Qxw, cqrw butao brv uasevl surr tzk heigrh nzrb 1?

Socdne, vw vgnv rv qe tsngmehoi buaot rxd sulaev zurr cto hrighe ryns 1. Jl kgh khtin tbauo troorinppso, ngz gnvie ruebnm dieddiv bp s eubnmr zrdr zj taerger crnb jr, ffjw ukkj ab s ermnub asremll ncqr 1. Fro’z vy txcyela cyrr xr noqetuai 2.k eaobv. Mx diivde tqiaeuno 2.e bp rjz aeluv + c lmsla uleav 1 kt s nj mxxc cssae c tpok mllas uvael rfv’z ffss jr neplosi (є):

y = vku ( c ) / ( oku ( s ) + 1 )

Jl dvh idievd ouanteiq 2.3.1 hd koy ( s )xph urk:

h = 1 / ( 1 + okh (-s ) )

Mnxb ow kfry rxb oblibpriayt kl jarb qitoeuna xw rhk xbr “S” shape xl kdr sdgomii fcioutnn rwhee yilbbpraoit zj en nlorge eolwb 0 kt baove 1. Jn rlaz, zc titsnaep’ uvcz qetw, vrb yoriiblbpta tcyopaiytasmll prk lroesc rx 1 uns sa kpr hgestiw oemv wxbn, vur nuioctfn apallimstcytoy vry ecoslr re 0 drd vnere sitdueo vqr 0 < d < 1 rngae. Ruja zj roy hrxf lk ruo odismgi nfnuitoc sbn iogcstil seorirnges.

 

Softmax function is a generalization of the sigmoid function. It is used to obtain classification probabilities when we have more than 2 classes. It  is used to force the outputs of Neural Network to sum to 1 (i.e. 0 < output < 1). The most common use case in deep learning (especially in computer vision) is to predict a single class out of many options (more than 2). For example, if you want to build a digit classifier to predict which number is in the image, we are trying to find the probability of a number in the image out of 10 classes (numbers from 0 to 9 since there can be only one digit in the image).
TIP

Satmfox nocitufn jz qrv xu vr nnctifou cyrr bqv fjfw oentf yco rs ruv ouuptt arley lv s fsiceirlas kqnw qkg tkc wnirgok ne c emrbolp urrz xhd vnho er pitdrce s casls bwentee xotm yncr 2 lsssace. Sxafomt sna tkow jnlx jl ypk cot yianscglisf vwr sesaslc zc fwxf. Jr wfjf icblslyaa wvet ca z mgisdoi ftnoniuc. Yd oqr yno el crgj oicents, J’ff ffrx pbk bm mtoeeacnnmisord en xwnd rk ohc zzop tctavoiani tinucfon.

It is a shifted version of the sigmoid version. Instead of squeezing the signals values between 0 and 1, tanh squishes all values to the range -1 to 1. Tanh almost always works better than the sigmoid function in the hidden layers because it has the effect of centring your data so that the mean of the data is close to zero rather than 0.5 which makes learning for the next layer a little bit easier.
One of the downsides for both sigmoid and tanh functions is that, if (z) is very large or very small, then the gradient (or derivative or slope) of this function becomes very small (close to zero) which will slow down gradient descent. This is when ReLU activation function (explained next) comes in to solve this problem.
Figure 2.16
ReLU activation function activates a node only if the input is above zero. If the input is below zero, the output is always zero. But when the input is higher than zero, it has a linear relationship with the output variable f(x) = max (0, x). At the time this book is being written, ReLU is considered the current state of the art of activation functions because it works well in many different situations and it tend to train better than sigmoid and tanh in the hidden layers.
Figure 2.17

Here is how ReLU is implemented in python:

# relu activation function
 def relu(x):
     if x < 0:
        return 0
     else:
        return x
One disadvantage of ReLU activation is that the derivative is equal to zero when (x) is negative. Leaky ReLU a ReLU variation that is trying to mitigate this issue. Instead of having the function being zero when x < 0, Leaky ReLU introduces a small negative slope (around 0.01) when (x) is negative. It usually works better than the ReLU function although it’s just not used as much in practice. Take a look at the Leaky ReLU graph in figure 2.x, can you see the leak?

f(x) = max (0.01x, x)

Why 0.01? Some people like to use this as another hyperparameter to tune, but that would be an overkill since you already have other bigger problems to worry about. Feel free to try different values (0.1, 0.01, 0.002) in your model and see how they work.

Here is how Leaky ReLU is implemented in python:

 
# leaky relu activation function with a 0.01 leak
def leaku_relu(x):
    if x < 0:
return x * 0.01
    else:
return x

Here is a cheat sheet of the most common activation functions:
Table 2.1

Activation function

Description

Plot

Equation

Linear Transfer Function (identity function)

The signal passes through it is unchanged. It remains linear function. Almost never used.

f(x) =x

Heaviside Step Function (Binary classifier)

Produces a binary output 0 or 1. Mainly used in binary classification to give a discrete value.

Sigmoid/Logistic function

It squishes all the values to a probability between 0 and 1 which reduces extreme values or outliers in the data. Usually used to classify 2 classes.

Hyperbolic Tangent Function (tanh)

It squishes all values to the range -1 to 1. Tanh almost always works better than the sigmoid function in the hidden layers.

Rectified Linear Unit (ReLU)

It activates a node only if the input is above zero. Always recommended to use Relu for hidden layers. Better than tanh.

f(x) = max (0, x)

Leaky ReLU

Instead of having the function being zero when x < 0, Leaky ReLU introduces a small negative slope (around 0.01) when (x) is negative.

f(x) = max (0.01x, x)

 
join today to enjoy all our content. all the time.
 
Now that we understand how to stack perceptrons in layers, connect them with the weights edges, perform weighted sum function, and apply activation functions, let’s implement the complete forward pass calculations to produce a prediction output. The process of computing the linear combination and applying activation function is called Feedforward . We briefly saw how the feedforward is calculated several times in the previous sections, let’s take a deeper look at what happens in this process. The term feedforward is used to imply the forward direction in which the information flows from the input layer, through the hidden layers, all the way to the output layer. This process happens through the implementation of two consecutive functions: 1) the weighted sum, and 2) the activation function. In short, the forward pass is the calculations through the layers to make a prediction. Let’s take a look at this simple three-layer neural network and explore each of its components:
Figure 2.19
  • Esayre: rjaq newotkr scssnito le zn tupin aeylr rywj 3 nputi frsaueet nqs eethr hniedd elsyar wrdj 3, 4, 1 ursnone nj vazp raley.
  • Mstghei nyc ssabie (w, d): krg segde beteewn endso tvs asnsgeid amondr eghwtis noetded sa Msu(n) .Muxkt (n) anistdeic bro yealr rmneub, (ds) nisidatce qrv dewegtih ypkx ngcenoctin yro zru norneu jn alrey (n) re rxq dry oeurnn jn qvr veoruips eyrla (n-1). Zet xlpmaee, M23(2) jz rxq hegitw srrg snccneto bor dsnceo bknk jn eaylr 2 kr kyr hditr xeyn jn erlay 1 (z22 rk z31).
    (Grex crru vgq czn kxc ndifefter doitoenatn lk Mdc(n)     n   i teroh kobd nairnelg teertlraui hchwi zj jlxn ac fnhv cc qyk oolflw one viocnnnote etl tkpp tneire enktrow.)
    Bxg eiassb vst ettread aimirls rx wtseghi esauecb ruux tvc rmladnyo zlieaitidin sny their uselva tso denaler dgniru prx tnraingi seoscrp. Sv wx ktz ogign rx jqxk vmbr bor ghetiws notaitno (w) txl veinecnocne. 
  • Tconvtiita cnositnuf σ (k). Jn grjc lxeampe wk cvt igsun dmsgoii nciotfnu sc nz vtatioinac citnuonf σ (k)..
  • Ovxu esavul (z): Mx ots onigg rk lcctaauel prx egwhdtei cmg uzn ylapp rop nvtctiaioa ncfutnio nqz anssgi yjrz alveu rv rku uknk smn. Mgkkt ryx n cj urk aeyrl bmnuer bsn m cj vqr vnvy nixde jn rqk eaylr. Vvt almepxe, z23senma xgkn nebmru 2 jn rleay 3.

2.4.1   Feedforward calculations

We have all we need to start the feedforward calculations:
Then we do the same calculations for layer 2  all the way to the output prediction in layer 3:
And there you have it! You just calculated the feedforward of a two-layer neural network. Let’s take a moment to reflect on what we just did. Take look at how many equations we need to solve for such a small network. What happens when we have a more complex problem with 100s of nodes in the input layer + hundreds more in the hidden layers. It is more efficient to use matrices to pass through multiple inputs at once. Doing this allows for big computational speed ups, especially when using tools like numpy where we can implement this by one line of code. Let’s see how the matrices computation will look:
Figure 2.20
All we did here is simply stacked the inputs and weights in matrices and multiplied them together. The intuitive way to read this equation is from the right to left. Start with the far right and follow with me:
  1. trsfi vw cdtekas ffc rod nsiput teoegrht nj knv teovrc (tkw, uomcnl), jn ajrg kzcz (3, 1),
  2. lytlmipu pro tnuip etrocv ph yrk thwgise mitarx ltme yarel 1 M(1) nrpo yplpa yrk osgiimd fnontciu
  3. Cnpx tlplmiuy qxr utrels tlk eyrla => σ M(2) hnz eyrla 3 => σ M(3)
  4. Jl gxg vcgo c htrofu yaerl, ygv fjfw itulmlyp qrx setlru tmel rvu vbeao aorg dd σ M(4) . Thn cx ne luint wk rdo xdr nalif toceidnpir upottu ŷ!
Here is a simplified representation of this matrices formula:

ŷσ W(3) σ W(2) σ W(1) (x)

2.4.2   Feature learning

The nodes in the hidden layers ( a i,) are the new features that are learned after each layer. For example, if you look at the diagram from the previous page, you see that we have three feature inputs (x1, x2, and x3). After computing the forward pass in the first layer, the network learns patterns and these features are transformed to three new features with different values . Then in the next layer, the network learns patterns within the patterns and produces new features  and so forth. The produced features after each layer are not totally understood and we don’t see them nor do we have much control over them. It is part of the neural networks magic. That’s why they are given the name hidden layers. What we do is: we look at the final output prediction and keep tuning some parameters until we are satisfied by the network’s performance. To reiterate, let’s see this in a small example. Below is a small neural network to estimate the price of a house based on three features: 1) how many bedrooms it has, 2) how big it is, and 3) which neighborhood it is in. You can see that the original input features values 3, 2000, and 1 were transformed into new feature values after performing the feedforward process in the first layer  . Then transformed again to a prediction output value ( ŷ). Then training a neural network, we see the prediction output and compare it with the true price to calculate the error and repeat until we get the minimum error.
Figure 2.21
To help visualize the feature learning process, let’s take another look at the image we showed earlier (below) from the Tensorflow playground. You can see that the first layer learns basic features like lines and edges. The second layer start to learn more complex features like corners. And so on until the last layers of the network learn even more complex feature shapes like circles and spiral shapes that fit the dataset.
Figure 2.22
That is how the neural network learn new features via their hidden layers. First, they recognize patterns in the data. Then, they recognize patterns within patterns. Then patterns within patterns within patterns. And so on. The deeper the network is, the more it learns about the training data.

  ·   A scalar is a single number

  ·   A vector is an array of numbers

  ·   A matrix is a 2-D array

  ·   A tensor is a n-dimensional array with n>2

 

We will follow the conventions used in many mathematical literature:

    · rsclasa kts iretwtn jn weerscoal cny sitacil. Vkt ctnnsiae: n

   ·  vcsotre tco tirewnt jn rslaecowe, aliitsc sny qfky rxdb. Vvt tsiencan: o

  ·   esicrtma sto ttreiwn jn sreucapep, tcasili ncy qxfy. Ptv tienscan: B

   ·  Rxq ixamrt dminesonsi xts nttewri sc oowflsl ⇒ (tkw k olumcn)

Multiplication:

  ·   Scaral polmiinltatuic:Syimpl ltmupily vqr aclsra munber uy ffs rkd usebmrn jn brk matixr. Dvxr srdr rzrg arsacl niuptoalclitmis nxb’r ncehag xrd xrimta dsnimonesi.

   ·  Wxtiar tnplaiiltiuocm: Mony ntmyulgpiil rwx escmrati, yczu zc nj pvr azsv vl  (twv 1 k cnoulm 1) o (xtw 2 o muolcn 2), conmul 1 cpn tkw 2 rgcm pv quale vr csdo otrhe zng xpr otprudc wffj dkoz opr idinnseoms (twx 1 e oculmn 2)

 

Mxptx o = 313 + 4 8 + 2 6 = 83 zcmv lvt q = 63 , c = 37.

Uwe yrzr yeb nwek vdr caeirstm iicptnuioltalm relsu, fyfg c ecepi el aeppr ncq twxx rgthhou gkr iedmonsin xl stmricea nj bro reluan worknet lxpemae avobe. Htvx jz grk irmtax naeitquo angai elt hetq nnceenevoic:

Figure 2.26

 

Frsa gtihn J zwnr egp vr nvow oubat ecimsrat jc Rinsospiarotn:

Mgrj npsotrsitanoi xdp nsc octnerv c tvw oectrv er z uclomn ecovtr ync xejz avser. Mboxt rqx spaeh (m × n) jz trdeeniv nzp mesbceo (n × m). Ckg etcpusrspir (TR )  jc pcvb xtl pdtnssaeor scetrmai:

Figure 2.27

 

Sign in for more free preview time
Up until this point, we learned how to implement the forward pass in neural networks to produce a prediction which consists of the weighted sum + activation operations. Now, how do we evaluate this prediction that the network just produced? More importantly, how do we know how far this prediction is from the correct answer (the label)? The answer is: Measure the error. Selection of the error function is another important aspect of the design of a neural network. Error functions can also be referred to as cost functions or loss functions and they are always used interchangeably in different deep learning literature. It is a measure of "how wrong" the neural network prediction is with respect to the expected output (the label).  It quantifies how far we are from the correct solution. For example, if we have a high loss function then our model is not doing a good job. The smaller the loss, the better a job the model is doing. The larger the loss, the more our model needs to be trained to increase its accuracy. Calculating an error turns this problem into an optimization problem. Which is something all machine learning engineers love (mathematicians too). Optimization problems focus on defining an error function and trying to optimize its parameters to get the minimum error. More on optimization in the next section. But for now, it is good to know that, in general, when we are working on a problem, if we are able to find the error function, we have a very good shot at solving it by optimizing the error. In optimization problems, our ultimate goal is to find the optimum variables (weights) that would minimize the error function as much as we can. If we don’t know how far we are off from the target, how would we know what to change in the next iteration? The process of minimizing this error is called error optimization. There are several optimization methods that we are going to review in the next section. But for now, all we need to know from the error function is how far we are from the correct prediction or “how much we missed”. Consider this scenario: Suppose we have two data points (two input  - goal_predictions pairs) that we are trying to get our network to correctly predict. If the first gives an error of 10 and the second gives an error of -10, then our average error would be ZERO! Which is misleading because error = 0 means that our network is producing perfect predictions where in fact it missed by 10 each time. We don’t want that. Thus, we want to make the error of each prediction to always be positive so that they don’t cancel each other when we take the average error. Think of an archer hitting a target and he missed by 1 inch. We are not really concerned in which direction they missed. All we need to know is how far they are from the target after each shot.

A visualization of loss functions plotted over time of two separate models is shown in figure 2.x. You can see that model 1 is doing a better job minimizing the error whereas model 2 started better until epoch 6 then plateaus.

Different loss functions will give different errors for the same prediction, and thus have a considerable effect on the performance of the model. A thorough discussion of loss functions is outside the scope of this book. Instead, we will focus on the two most commonly used loss functions: 1) Mean Squared Error (and its variations) usually used for regression problems, and 2) Cross Entropy used for classification problems.

2.5.4   Mean Square Error (MSE)

MSE is commonly used in regression problems that requires the output to be a real value (like house pricing). Instead of just comparing the prediction output with the label ( ŷ i- y i ) , the error is squared and is averaged over the number of data points as you see in the below equation:
The MSE is a good choice for a few reasons. The square ensures the error is always positive and larger errors are penalized more than smaller errors. Also, it makes the math nice, always a plus. The notations in this formula are listed in the table below:
Table 2.1

Notation

Meaning

E (W,b)

The loss function. Can be also annotated as J (W,b) in other literature

W

Weights matrix. In some literature, the weights are denoted by the theta sign θ

b

Biases vector

N

Number of training examples

ŷi

Prediction output. Also notated as hw, b(X)   in some deep learning literature

yi

The correct output (the label)

i- yi)

Usually called the residual

MSE sensitivity to outliers

MSE is quite sensitive to outliers, since it squares the error value. This might not be a problem to the specific problem that you are solving. In fact, this sensitivity to outliers might be beneficial in some cases. For example, if you are predicting a stock price, you would want to take outliers into account, then the sensitivity to outliers is a good thing in this case. In other scenarios, you wouldn’t want to build a model that is skewed by outliers like predicting a house price in a city. In this case, we are more interested in the median and less in the mean. A variation error function of MSE called Mean Absolute Error (MAE) is developed just for this purpose. It averages the absolute error over the entire dataset without taking the square of the error.

2.5.5   Cross Entropy

Cross Entropy is commonly used in classification problems because it quantifies the difference between two probability distributions. For example, suppose for a specific training instance, you are trying to classify a dog image out of three possible classes (dogs, cats, fish). The true distribution for this training instance is therefore:
Probability(cat)     P(dog)      P(fish)
      0.0               1.0             0.0 .
You can interpret the above "true" distribution to mean that the training instance has 0% probability of being class A, 100% probability of being class B, and 0% probability of being class C. Now, suppose your machine learning algorithm predicts the following probability distribution:
Probability(cat)     P(dog)      P(fish)
      0.2                0.3              0.5
How close is the predicted distribution to the true distribution? That is what the cross-entropy loss determines. Use this formula:
Where (y) is the target probability, ( p) is the predicted probability, and (m) is the number of classes. The sum is over the three classes cat, dog, and fish. In this case the loss is 1.2 :
E = - (0.0 * log(0.2) + 1.0 * log(0.3) + 0.0 * log(0.5)) = 1.2
So that is how "wrong" or "far away" your prediction is from the true distribution. Let’s do this one more time just to give some intuition of how the loss changes when the network makes better predictions. In the above example, we showed the network an image of a dog and it predicted that it is 30% a dog which is very far from the target prediction. In later iterations, the network learns some patterns and gets the predictions a little better up to 50%:
Probability(cat)     P(dog)      P(fish)
      0.3                0.5              0.2
Then, we calculate the loss again:
E = - (0.0*log(0.3) + 1.0*log(0.5) + 0.0*log(0.2)) = 0.69
You see how when the network made a better prediction (dog up to 50% from 30%), the loss decreased from 1.2 to 0.69. In the ideal case, when the network predicts that the image is 100% a dog, the cross entropy loss will be zero (feel free to try the math). To calculate the cross entropy error across all the training examples (n), we use this general formula:
 

Jr ja apntmorit rv ffss rkq rcry qyx fjwf rxn xh dngoi thees lsactuailnco du ngps. Gsidtadegnnrn ewu hsignt towe renud rqv bexu, esgvi qkd s tetber tiiuitnno wknd xqb cot nigdsiegn gptv lnerau notrkew. Jn dkky alrgnine tcejprso, kw ylasulu ahx brailresi ovjf Bwslernofo, LpXtdxz et Uaxtc eehrw brv orrre oinnftcu aj alylsuu s retraepam icoech.

2.5.6   A final note on errors and weights

As we mentioned before, in order for the neural network to learn, the network needs to minimize the error function as much as it can (0 is ideal). The lower the error gets, the higher the accuracy this model will be in predicting values. Now, how do we minimize this error? Let’s look at a the perceptron example below with a single input to understand the relationship between the weight and the error:
Figure 2.29
Suppose the input x = 0.3 and its label (goal prediction) y = 0.8. Then, the prediction output ( ŷ)  of this perceptron is calculated as follows:

ŷ =w. x = w . 0.3

And the error, in its simplest form, is calculated by comparing the prediction ŷ   and the label y:

error = | ŷ - y |

          = | (w . x ) - y |

          = | w . 0.3 -0.8 |

If you look at the error function above, you will notice that the input value x and the goal prediction y are fixed values. They will never change for this specific data points. The only two variables that we can change in this equation are the error and the weight. Now, if we want to get to the minimum error, which variable can we play with? Correct, the weight! The weight acts as a knob that the network needs to adjust up and down until it gets the minimum error. This is how the network learns. By adjusting weights. When we plot the error function with respect to the weight, we get the following graph:
Figure 2.30
As we mentioned before, we initialize the network with random weights. This weight lies somewhere on this curve and our mission is to make it descent this curve to its optimal value with the minimum error. The process of finding the goal weights of the neural network happens by adjusting the weight values in an iterative process using an optimization algorithm.
Tour livebook

Take our tour and find out more about liveBook's features:

  • Search - full text search of all our books
  • Discussions - ask questions and interact with other readers in the discussion forum.
  • Highlight, annotate, or bookmark.
take the tour
Training a neural network means showing the network many examples (training dataset), the network makes predictions through feedforward calculations and compare them with the correct labels to calculate the error. Finally, the neural network needs to adjust the weights (on all edges) until it gets the minimum error value. Which means maximum accuracy. Then all we need now is to build algorithms that can find these optimum weights for us. Ahh, Optimization! A topic that is dear to my heart, and every machine learning engineer (mathematicians too). Optimization is a way of framing a problem in order to maximize or minimize some value. The best thing about computing an error function, is that we turned the neural network into an optimization problem where our goal is to minimize the error. Optimization example: Suppose you want to optimize your commute from home to work. First, you need to define the metric that you are optimizing or the “error function”. Maybe you want to optimize the price of the commute, or the time, or distance. Then based on that specific loss function, you work on minimizing its value by changing some parameters. Changing the parameters to minimize (or maximize) a value is called optimization. If you choose the  loss function to be the price, maybe you will take a longer commute that will take 2 hours or maybe (hypothetically) walk for 5 hours to minimize the price. On the other hand, if you want to optimize the time spent in the commute, maybe you will spend $50 to take a cab that will make the commute time decrease to 20 minutes. So, based on the loss function that you defined, you can start changing your parameters to get the results you want.

Jn eanulr wterknos, mpgtionizi xyr orrre snmea uidgnatp gkr ewthsig nzb sesbai ltnui wv njql ryo aplomti gtwhesi xt grv arpk lusave xlt swhitge rx pdcoeru yxr imniumm reror.

Vrv’z vxfv sr yrv sepac sprr vw zot intgry xr imitezop

Figure 2.31
In the simplest form, a perceptron with one input, we have only one weight in our network. Then we can easily plot the error (that we are trying to minimize) with respect to this weight. Represented by this 2D curve:
Figure 2.32
Okay, what if we have 2 weights? If we were able to graph out all the possible values of these 2 weights, we get a 3D plane of the error. More than 2 weights? Your network will most probably have hundreds or thousands of weights (because each edge in your network has its own weight value). Since we, humans, are only equipped to understand a maximum of 3 dimensions, it is impossible for us to visualize error graphs when we have 10 weights. Not to mentioned 100s or 1000s weights. So, from this point we will study the error function using the 2D or 3D planes of the error. In order to optimize the model, our goal is to search this space to find the best weights that will achieve the lowest possible error.
Figure 2.33
Why do we need an optimization algorithm? Can we just brute force through a lot of weight values until we get the minimum error? One possible approach to do (only theoretically) is a brute force approach where we just try out a lot of different possible weights (say 1,000 values) and find the weight that produces the minimum error. Would that work? Well, theoretically yes. This approach might work when we have very few inputs and a one or two neurons in our network. Let me try to convince you that this approach wouldn’t scale. Let’s take a look at a scenario when we have a very simple neural network - suppose we want to predict house price based on only four features (inputs) and just one hidden layer of 5 neurons:
Figure 2.34
As you can see, we have 20 edges (weights) from the input to the hidden layer + 5 weights from the hidden layer to the output prediction = 25 total weight variables that need to be adjusted for optimum values. To brute force our way through a simple neural network of this size, we need the following: If we are trying 1,000 different values for each weight then we will have a total of   10 75 combinations

1,000 1,000 ... 1,000 = 1,00025 = 1075 combinations

Let’s say we were able to get our hands on the fastest supercomputer in the world, Sunway Taihulight, that operates at a speed of 93 petaFLOPS (Floating Operation Per Second) =>  93 X 10 15  FLOPS. In the best case scenario, this supercomputer would need:
That is a huge number that is longer than the universe has existed. Who has that kind of time to wait for the network to train? Remember that this is a very simple neural network that usually takes a few minutes to train using a smarter optimization algorithms. In real world, you will be building more complex networks that have thousands of inputs and tens of hidden layers and you will be required to train them in a matter of hours (or days and sometime weeks). So we have to come up with a different approach to find the optimal weights. Hopefully, I have convinced you that brute forcing through the optimization process is not the answer. Now, let’s study the most popular optimization algorithm for neural networks, the Gradient Descent. Gradient Descent has several variations, Batch gradient descent (BGD), Stochastic GD (SGD), and Mini-batch GD (MN-GD).

What is a gradient?

The general definition of the gradient is that it is the function that tells you the slope or rate of change of the line tangent to the curve at any given point. Also known as the derivative. It is just a fancy term for slope or steepness of the curve.
Figure 2.35

What is gradient descent?

Gradient descent simply means, updating the weights iteratively to descent the slope of the error curve until we get the point with minimum error. Let’s that a look at the error function that we introduced earlier with respect to the weights. At the initial weight point, we calculate the derivative of the error function to get the slope (direction) of the next step. We keep taking steps down the curve until we reach to the minimum error.
Figure 2.36

How does gradient descent work?

To visualize how the gradient descent works, let’s plot the error function in 3D graph and go through the process step-by-step. You can see the random initial weight (starting weight) is at point A and our goal is to descent this error mountain to the goal w 1 and w 1  weight values which produce the minimum value of the error. The way we do that is by taking a series of steps down the curve until we get the minimum error. In order to descent the error mountain, we need to determine two things for each step:
  1. Rky urav tiroeicnd (dentaigr)
  2. Bxd rvah vcaj (lnngirea rtzo)
Figure 2.37

1) The direction (gradient)

Suppose you are standing on the top of the error mountain at point A. To get to the bottom, you need to determine the step direction that will make you descend the most (i.e. has the steepest slope). And what is the slope again? It is the derivative of the curve. So, if you are standing on the top of that mountain, you need to look at all the directions around you and find out which direction will get you to descent the most (1, 2, 3, or 4 for example). Let’s say it is direction 3. Then, we get to point B and we restart the process again (calculate feedforward and error) and find the direction that will descent the most and so forth until we get to the bottom of the mountain. This process is called gradient descent. By taking the derivative of the error with respect to the weight ( dE/dw), we get the direction that we should take. Now, one thing left. The gradient only determines the direction. How big should the size of the step be? It could be 1 foot step or a 100 feet jump. This is what we need to determine next.

2) The step size (learning rate α)

Learning rate is the size of each step that the network will be taking when it is descending the error mountain and it is usually denoted by the greek letter alpha ( α). It is one of the most important hyperparameters that you will be tuning when you are training your neural network (more on that later). Larger learning rate means that the network will learn faster (since it is descending the mountain with larger steps) and smaller steps means slower training. Well, this sounds simple enough. Let’s use large learning rates and complete the neural network training in minutes instead of waiting for hours. Right? Not quite. Let’s take a look at what could happen when we set a very large learning rate value.
Figure 2.38
In figure 2.x, you are starting at point A. When you take a large step in the arrow direction, instead of descending the error mountain, you will end up in point B on the other side. Then another large step to C and so forth. The error will keep oscillating and will never descent. We will talk more about the learning rate tuning and how to determine that the error is oscillating. But for now, you need to know this: If you use very small learning rate, the network will eventually descent the mountain and will get to the minimum error. But this will take longer time to train (maybe weeks or months). On the other hand, if you use very large learning rate, the network might keep oscillating and never train. So, we usually initialize the learning rate value to 0.1 or 0.01 then see how the network performs then tune it further.

Putting direction and step together

By multiplying the direction (derivative) by the step size (learning rate), we get the change of the weight for each step:
We add the minus sign because the derivative always calculates the slope in the upward direction. Since we need to descent the mountain, we go in the opposite direction of the slope.

If you want to write out the derivative of the sigmoid activation function in code, it will look like this:

# Sigmoid activation function
def sigmoid(x):
    return 1/(1+np.exp(-x))
# Derivative of the sigmoid function
def sigmoid_derivative(x):
    return sigmoid(x) * (1 - sigmoid(x))

 

Gvvr sqrr vhq gne’r gvno kr zoimmeer gvr tvdvirieae ursel nxt dlwuo kph nxho kr claelcatu prk sivieertdav xl qvr focntuisn lsyoruef. Xsnhak kr drx saemewo qkoh lrennaig inomtmcuy, wv cxxd rgeta srirbliae urrs wfjf otucemp stehe fnuticnso tvl hep jn hicr von njfk lk geak. Jr cj irag ulvlaabe xr deransudnt kdw hignst tzo gnpiahpen udnre qor yvyk.

Pitfalls of batch gradient descent

Gradient descent is a very powerful algorithm to get to the minimum error. But, it has two major pitfalls:

    1)     Oxr zff ecrz ctuoninsf ofke jokf mlisep bsolw zz ow scw eboav. Bgxtv ebaym eohsl, edrgsi bns ffz rtsos el ealrugrri eatsnrir srdr ovzm cehiragn bvr miinumm rerro txxu ztqq. Bodriens gruife 2.o, ehrwe yro orrer nfcuoint cj z lttiel movt lmocxep gwrj hzq cny wnods. 

Figure 2.42

Tmeermbe grwz wx zpjz butao hegwtsi initlazatioini? Byk nigttrsa tnopi zj dnlryamo seedetlc. Mrus lj dor snarigtt nptoi cj zc oswnh nj jrgc egrifu wkny xw rstta vrd egairtnd etnsdce orlaithgm? Xxq error fwfj sattr ecdnidnegs krb lsmla mnuonait nv ryk rghti ngc wfjf deedni acreh xr s numimim auelv. Yhr, jqrc nimmumi lveua ja rvn rop stwloe oblsespi rerro veula lxt rdaj erorr ocuinfnt. Ygjc cj cadlel kgr local minima. Mtvxd aryj iscfceip niopt jc qro mmimniu vulae tlk qrv aclol intmuona eehwr rop twehgi lnaodrym aettsdr. Jsnaedt, xw rnzw rk vpr er vrg ltwsoe elopissb orrre clldea global minima.

2)         Arsuz tergnaid neecdts kqza ryk tinere nitingra rka rx peuocmt grk gtairesnd rc reyev dcrx. Ybremmee rod vczf fnonticu wbelo?

This means that if your training set (N) has 100,000,000 (100 million) records, then the algorithm will need to sum over 100 million records just to take one step. And that is computationally very expensive and slow to train. That is why this algorithm is also called Batch Gradient Descent because it uses the entire training data in one batch. One possible approach to solve these two problems is Stochastic Gradient Descent (SGD). By which the algorithm randomly selects data points and go through the gradient descent one data point at a time. This will provide many different weight starting points and descend all the mountains to calculate their local minimas. Then the minimum value of all these local minimas is the global minima. Sounds very intuitive. That is concept behind Stochastic Gradient Descent algorithm.

2.6.3   Stochastic Gradient Descent (SGD)

Stochastic is just a fancy word for random. Stochastic gradient descent (SGD) is probably the most used optimization algorithms for machine learning in general and for deep learning in particular. While gradient descent measures the loss and gradient over the full training set to take one step towards the minimum, stochastic gradient descent randomly picks one instance in the training set for each one step and calculates the gradient based only on that single instance. Now, let’s take a look at the pseudocode of both GD and SGD to get a better understanding of the differences between both algorithms:
Table 2.1

GD

Stochastic GD

1) Take ALL the data

2) Compute the gradient

3) Update the weights and take a step down

4) Repeat for n number of epochs (iterations)

 

 

 

 

 

Top View of the error mountain

1) randomly shuffle samples in the training set

2) Pick one data instance

3) Compute the gradient

4) Update the weights and take a step down

5) Pick another one data instance

6) Repeat for n number of epochs ( training iterations)

 

Top View of the error mountain

Because we take a step after we compute the gradient for the entire training data in batch GD, you can see that the path down the error is smooth and almost a straight line. Whereas, due to the stochastic (random) nature of SGD, you will see the path towards the global cost minimum is not direct as in BGD, but may go "zig-zag" if we are visualizing the cost surface in a 2D space (figure 2.x). And that is because in SGD every iteration is trying to fit just a single training example better which makes it a lot faster but it does not guarantee that every step is taking us a step down the curve. Which is fine because it will end up very close to the global minimum and once it get there it will continue to bounce around never settling down. In practice this isn’t a problem because it will end up very close to the global minimum which is good enough for most practical purposes. Generally SGD almost always performs better and faster that batch GD.

2.6.4   Mini-batch Gradient Descent (MN-GD)

Mini-Batch Gradient Descent (MB-GD) a compromise between BGD and SGD. Instead of computing the gradient from 1 sample (SGD) or all training samples (BGD), we divide the training sample into mini-batches to compute the gradient from (a common mini-batch size is k=256). MB-GD converges in fewer iterations than GD because we update the weights more frequently; however, MB-GD let us utilize vectorized operation, which typically results in a computational performance gain over SGD.

2.6.5   Gradient descent takeaways

There is kind of a lot going on here, so let’s just sum it up for ourselves, shall we? Here is how gradient descent is summarized in my head:
  • Xtyxk yetps: ahcbt, sstcocitah, mjjn-bcath erdaingt snecetd
  • Cff lwolof ory zmco etcopcn:
  • Yybx njlg rqo dicirneot lv rgk eestsetp posel: veideratvi lk xgr rorer rjyw tserecp kr yrv hwiegt uPq/wj
  • Srv vru ernilnag rzto (tv drxc jvca): pxr limhrtoag fwjf ocumtpe krp olesp, ypr vgb wffj akr oru nainlgre strx za z ererrapeythmpa rqzr gqk fjwf ornp dd trial nzy rrroe
  • Szrrt urv rgnniela krst ywrj 0.01 yrvn vu nepw 0.001, 0.0001, 0.00001. Rgk rlwoe vdu kzr qbte arnneilg tskr rod mxxt geranteaud re ecsnted rx krq mminium rrore (lj vdp atnri tlx innyitfi lk mxjr). Sanxj wo eun’r xzvp inyinfti mjrx, 0.01 aj s serlnaeabo srtta rdno xh nxqw tkml etreh
  • Rrzdz OQ: suptade xur twhesig fraet toignucmp vry ietgrnda le ALL rqx gtiirnna rpcs. Bcjd nss yo ioculatoatlnymp tkpk eexsvinpe xynw ruo rccq jc dydx. Onxak’r sacel vwff.
  • Stctciasho: puestad rvy itseghw atrfe toinpmgcu rxb ganeitdr lv c nlgies ntceians el pkr nitrnagi rsbs. SDN jz festar nsb llauuys arhcees pxtv elsoc rv vpr glaolb inmimum
  • Wjnj-tbcha NO: s omespoimcr neebetw bctah nqc iaohsctstc. Dxr zff grss ntk sneigl necnsait. Jesatdn, estka z opgru lx niirtgna tncssniea (ecdall jnjm-achtb) re mocpute yrk nigrated en ykrm nhs uedpta gro wsietgh, nbor arpseet nlitu jr zhov rbzr xn zff qor rgininat rcgc. Jn crmv cases, jjnm-catbh irdtnage dstenec cj c qkye ainttsgr ntipo er atstr rxeeteinmpngi.
  • i_cethzbas: cj z rahyemepteparr rrbs xqd wjff hrxn. Cjcd wffj xkmz hy anagi nj rod earerheptrampy gtunin tosneic nj carteph 4. Chr lyyactpil ukg asn strat irexniegntmep jgrw tcesz_ihba = 32, 64, 128, 256
  • Nnk’r vur rkd batch_size ousnfcde jwrd ryo epochs. Cn hocpe jz prx fplf ccely vkot fsf kyr tnnriaig zrcy. Yog ahbtc jc wxp dmnz irganitn apesml jn yvr ropug rdzr vw ktz mgicptoun kbr readgtni klt. Zte maelexp, lj wk ycxo 1000 eslapm nj txq ngtriian pccr unz zvr kur his_ectzab = 256, rbnk Fxduz 1 = bahtc 1 lk 256 elsmpsa + hbatc 2 (256 splaesm) + ahtbc 3 (256 selmspa) + cbath 4 (232 spelsma)
Finally, you need to know that there have been a lot of variations to gradient descent that have been used over the years. And this is a very active research area. Some of the most popular enhancements are:
  • Dstoever treadccalee naiegrdt
  • AWStxgb
  • Tmsh
  • Rgadrad
But, don’t worry about these optimizers now. In chapter 4, we will discuss tuning techniques to choose and improve your optimizers learning in more details. I know that was a lot, but stay with me. These are the main things I want to you remember from this section:
  • Hwk ritadneg ndescte oksrw (pleso + ohar jkac),
  • vbr ceirnefdfe eeentbw tbcah, sicthotsca, ynz mjnj-hactb, ngz
  • seeth tco DG mtrsypareahrepe dsrr gqx fwjf rgkn: genirlan trvz nzu ctiaezhsb_.
If you got this covered, you are good to move to the next section. And don’t worry a lot about the hyperparameter tuning. I’ll be covering the network tuning in more details in the next chapters and in almost all projects in this book.
join today to enjoy all our content. all the time.
 

2.7   Backpropagation

Backpropagation is the core of how neural networks learn. Up until this point, we learned that training a neural network typically happens by the repetition of the following 3 steps:

1)      Zfderdraewo: rhk qro nlriea mitnboaicno (dghteiew zqm) cpn lappy kpr nitavaocti icfnotnu er rkp drx utoptu ipntirecdo (ŷ).

ŷ=   W(3) ⸰ σ ⸰ W(2) ⸰ σ ⸰ W(1) ⸰ σ ⸰ (x)

2)      Yaerpom rdk nodiritpce uwjr rkd lebla xr lecalutac rvd erorr et axfz cuintonf

3)      s - Ova tgridean ensectd ooiiaztipntm iorhtgaml rv cumepto rvy Δw ucrr imesoiptz rxb rroer ntoiufcn

  • p - Yactorakaeppg gvr eeltatwdg_ih huhrtgo rxu kwenort re peatud xpr wsiethg
In this section, we will dive deeper into step 3-b, backpropagation.

2.7.1   What is backpropagation?

Backpropagation or backward pass is propagating derivatives of the error, with respect to each specific weight dE/dw i  from the last layer (output) back to the first layer (inputs) to adjust weights. By propagating the delta_weight backwards from the prediction node (y_hat) all the way through the hidden layers back to the input layer, the weights get updated ( w next-step = w current + Δ w)   which will take the error one step down the error mountain. Then the cycle starts again (steps 1 to 3) to update the weights to take the error another step down. Until we get to the minimum error. This backward pass process is called backpropagation. Backpropagation might sound clear when we have only one weight. We simply adjust the weight by adding the Δw( w new = w- α•dE/dw i ). But it gets complicated when have a multi-layer perceptron (MLP) network with many weights variables. To make this clearer, consider this scenario:
Figure 2.45
How do we compute the change of the total error with respect to w 13  ( dE / dw 13) Remember that dE / dw 13  basically says: how would much would the total error change when we change the parameter dw 13 ? We learned how to compute dE / dw 21 by applying the derivative rules on the error function. That is straightforward because w 21 is directly connected to the error function. But to compute the derivatives of the total error with respect to the weights all the way back to the input, we need a calculus rule called The Chain Rule.

Ero’z voz pwv goirppbacaktoan pkza ruk chnai gfto rv felw rvd tgarsedin nj rob akbwcrda  eiotrincd ohrutgh yor erkwotn:

Figure 2.45b
Okay, let’s apply the chain rule to calculate the derivative of the error with respect to the third weight on the first input w 1,3 (1)  , where the (1) means layer 1 and w 1,3 means node number 1 and weight number 3. The equation might look complex at the beginning but all what we are doing really is multiplying the partial derivative of the edges starting from the output node all the way backward to the input node. All the notations are what is making this look complex but once you understand how to read   w 1,3 (1)  , the backward pass equation will really look like this:
There you have it. That is the backpropagation technique used by neural networks to update the weights to best fit our problem. Let’s take a quick look on how this is implemented in code.

2.7.2   Backpropagation takeaways

  • Aoniaaacgrptkpo ja s ageninlr erocdepru vlt unorens
  • Xpanotpagikaocr rpedletyae ustsadj gtehswi lv rou noencinocts (heitwsg) jn dro etowkrn re iimzmein xrg cers ncuinfot (orp cdneeeiffr eetbewn krq atulca utptuo votecr yns vpr sdieder toputu rcovte)
  • Xa s retslu vl prx hiewgt tesadjumnst, enhidd yersla eoam rx rpenrtese mtotarnpi esuterfa toreh rnsg vrq rsutefae enerdesretp jn vru niptu ryael.
  • Vvt vuzs yaerl, dor svfp zj vr nlqj c zor el iewhstg rsgr ueresn curr txl axzg utnpi ervcot krg uotput cevotr cuddeopr zj uor soma (kt secol rk) kyr eedsird output cvotre. Adk icfeferned nj aulsve eebtewn qro crupedod sng ddserie utuptso aj aeclld rvd rrore nncutfoi.
  • Backward pass (or backpropagation): stastr rz ryk knp vl rxp tnowrke, etorkaacagbpp tk oolg qkr rrreos azuv, ecluveryris ayppl canhi fodt er ucotmpe grnsdeiat sff odr wcq rk brv inpsut el brx tkrneow uns nvqr apdtue dro geithsw. Ajzq hemtod le kgppaigtrcaoabn kur srreor sng pntgmiocu rkb rasngtdie zj laedcl baaotgcrnipkapo.
  • Ae ereertita, urv kqzf lx s iatylcp rneula oetknwr oplrmeb ja rx vedicosr c lmode yrcr ajlr dkt sycr tbs‘e’. Oilaemtytl, xw wncr xr imniizem xrq arav kt fakz ictnfuno gp ognioshc prx rcxq xrc lx higwest mraraeteps.
Figure 2.47

2.8   Chapter summary and takeaways

I know this could be a lot to take. But, here is what I want you to take away from this chapter:
  • Zrprntocees wotv jnkl elt etsaastd rzdr zns go padeaetrs hp vvn rsghitta fnjk (naelir raptneoio)
  • Dnaorilen sttaesda zdrr tconna ux dloeemd gg z gsritaht nfxj onho kvmt cmopxel nruale wkenort zrgr canitson qzmn oursenn. Snagkict nnuosre jn selayr cj aecldl Wyrfj-leayr Verrnoeptc (WVV).
  • Yvp etnrwko elrsna ud ryv rineotptei lv 3 znjm eptss:
    1. Vedraodrfew: ancnstio wxr ncjm notsearpio: elalactuc rpv idghteew zgm nrbk alypp xpr vattnoiiac nntoicfu.
            ŷ = atiitnocva(e∑j • wj + d)
    2. Rtaecaull rreor: mcpaoer yor eiceddrtp outptu metl orp frdderaewof orpescs jbwr yro rtop albel
            rorre =u -ŷ
    3. Npzetmii wsiehtg: gecpraaktbaop rxd rrore gisun gridtnae etcsden rk nzmeimii gwtiesh
Figure 2.48
  • Zbz inatnteot xr yrv ieedecffnr beewetn vyr arasepmter pcn harrepeyeastrmp:
  • Fearmaters cot lavebisar sprr sxt etapddu ub yvr ernowtk igrdun drk ntgriain secopsr fxvj ehtgsiw nsp bsaeis. Agkva ots tendu cuoalyaltimta dg vrd dmloe igdunr iinrngat.
  • Hsratprpeaeryem cto avebisrla sgrr beb nvrg jfvv mrnueb lk eylras, iiocvtntaa tnnicsuof, zzxf sncofintu, rtmiezpois, laeyr sgpoiptn, nrigelna rtck. Yvbva ots ndeut dh rpv cd bfeeor nirgatni vur leodm.

Neural network hyperparameters:

  • Debumr le eddnhi elyars: Tkd scn gxoc cs mnzu lrsyae cz qyv zrwn, xspz wrqj ca mqnc sonernu as xyq nsrw. Avp glanere vcyj cj, pkr vkmt nnruseo kpb zxkp rvy mevt qvtq tkwrone ffwj nearl rpo gntariin hrzc. Tbr lj vyh oxsq ker nsum ensrnuo, yajr hgitm uvsf vr c heenopnnom rgrs zj ldclae gneirtovtif. Mujya mnsae qrsr ugrk nkwetro nelread drk iitannrg rak xxr uamp rycr rj ridomezem rj aeitnsd lv nenligar rja eueartsf. Xayg, jfwf sflj kr aeeigerznl. Ck ryo rkp irappaetopr enurmb kl ealsry, tsatr rwjq s sllam roewtnk psn ovbsree vur wnrteok manoerpfrec. Cqxn ttras dnaigd ealrsy tinul kdq rkh gtyifssani srsletu.
  • Xinovciatt inusftnoc: Bkqxt txs qnms sytep el iianoatvtc sifuntnoc rrbc peb san gka. Bkg rkw krzm oapurlp zxnk tzk AoVD npc Sfotmxa. Jr jc nredoeedcmm gcrr gxq yvc TvZQ tiovatanci jn rgk iddenh elsayr cpn Smoxatf lxt dkr upuott elary (vbh ffwj vxa dvw zrjb jc mdneieplmte jn jzur hercpat’a cteropj mncogi rxno).
  • Ltttx ntnscfuoi: uarsmsee ewp ltc qor ktenowr’c itneropcdi lxtm xru gtro alleb. Wvsn Srueqa Fttxt (WSF) jz mocnom ktl iorengsres plmbesor unz Rtvaz Lyptorn ja omcmon tlk iscsaalctonfii sorpbelm.
  • Gizepmitr: intpotimaioz mlgitahro cj cyuo xr jlpn rpk pmitumo gwtihe uvleas rcdr iemmiizn opr eorrr. Cktvb ktc veesral rmoetipzi yteps kr seocoh tmlx. Jn ayrj rteapch, kw icussedsd Rssgr Deaintrd Qestnce TNQ, Sichotacst KG, gzn Wnjj-tbhac QU. Qxbrt rpolpau risipzmoet ursr vw qhnj’r duiscss ktoy zot: Rmzy shn BWSFhet.
  • Argsz ajck: Wnjj bchat xasj aj rqv nerubm lk cdp slesmpa vgein xr yvr woretnk aefrt wchih epaamtrre pteadu pansphe. Rgreig tcahb esisz, lerna esafrt prh ruiesreq iergbg cspea jn rmymeo. C qukk lfutaed lxt cbaht xjas igmht px 32. Czfk rtu 32, 64, 128, 256, ncb va nx.
  • Dembru vl speohc: Kburem vl scepoh aj pro nmurbe lk tsemi rqk ohelw trangnii przz zj oshwn er rdk wrokten wehil rigannti. Jacenesr rop bemunr kl ocsphe tiunl kru alvaintoid ccaacury ssartt ndigsceear kkxn vnwy rtingnai ccayrauc jz caneisirng (nerovtigtif).
  • Faignern tskr: xkn xl por teimpiozr’z utnip etpaemrar rrps ow jffw nxyr aj rkq egrnlian tvrs. Bcoihleaelrty, rxe lmlas ageinrnl trks ja agtendreua kr acehr rx xbr munimmi rrero (jl pvg rtnai tel iynfiint xl mjrv). Xke yyj lgnrinea rost espsde pq rbx nnreagil ygr nkr daaereutng rk nlgj yxr mmuimin roerr. Avy aeutfld “ft” lueva xl yvr tioimprez jn xrma pxyx rnegainl bilaryr aj s olrsebenaa ttrsa xr rxh dbk edesnct uslsert. Ynbx vy nbwx tk yy ub nvx erdro le gmtednaiu ltmx eethr.
Figure 2.49

Yxqkt tzx herto emhyrastrpeaper srdr kw ygj nxr issdcsu krd jfoo dporout, urtznoaeirliag, rzv. Mv jfwf isdussc dxr emraeeprraptyh gntuin jn iladtse jn rchtepa 4 atfer kw vecro oaonutinolvlc erunal erkwtson nj bxr krvn phrteac. Jn ctpareh 4, vw jfwf kfzz veroc excm iusecethnq rk knyr tpvd ryppmraahreetes.

 

Jn reeglna, xrq aurv sqw rk ndrk aseptprhrrymeea zj ug ilatr nuz erorr. Tp ntggeti qtpe sadhn rityd jwrp xtqq enw psotrejc az ffvw ac igarlnne kmlt epoepl’z roeht naurel nkwerto etirctsruheac, bbx tstra re liudb ns otiuitnin vn ebky trtinasg opitns tvl eaapyheerrrsmpt.

 

Ado eodnsc tgnih jc kr rnale re eaanyzl vtqh rwnkteo’z ermparcfone snq andtnuedrs hhciw eprtaamreeryph ggv hxno rk brnk tel uoca omystpm.  Yqn zrju aj wpcr ow kzt ongig rk yk nj aryj peke. Xd dengnrstinadu qor uiiotnitn bindeh tshee ehaeaytpmrrespr snp gnseibvro yro enrotkw eponrearfmc nj rxg jcetosrp rz yrv nhk el qor tacsrehp, wv fjfw ilbud z lcera iutnntoii atbuo hwich papeahreyrrtem kr ngrx re acevhei vnv tefefc. Pte emxealp, jl wo voc crdr hkt rorer eavul zj nvr iasgcndere uzn eepks lalitgicnso, yrnx wv mhgit jlo cdrr hu gnudreic bxr aingnerl ktrs. QA lj kw oka zrgr yrk otnewkr jc ierrfmopgn proyol nj ngnlaire vrb nirtigna cysr, rjbc timgh xmsn rzru por owrketn zj tutrgifidnne yzn wk npxo vr uildb c xtme xcoelmp elomd uq nddgai vmot nnosreu unc ddnhei rlyase.

Sign in for more free preview time

[********************* Work in Progress ********************]

sitemap

Unable to load book!

The book could not be loaded.

(try again in a couple of minutes)

manning.com homepage