In the last chapter we discussed the computer vision pipeline components: 1) input image, 2) preprocessing, 3) extracting features, and 4) learning algorithm (classifier). We also discussed that in traditional ML algorithms, we
manually extract features that produces a vector of features to be classified by the learning algorithm. Whereas, in deep learning, neural networks acts as the feature extractor + classifier. It
automatically recognizes patterns and extracts features from the image and classifies them into labels.
Figure 2.1
In this chapter, we will take a short pause from the computer vision context to open the “deep learning algorithm” box from the figure above. We will dive deeper into how neural networks learn features and make predictions. Then, in the next chapter, we will come back to computer vision applications with one of the most popular deep learning architectures, Convolutional Neural Networks (CNNs). The high-level layout of this chapter will be as follows:
Mo genib cujr etprach jbrw dro vmzr scbia potnmnceo lv vrg euranl wrekotn, vpr Lpeonrtrec. Msuyj aj z ularne nrwtoek zrrp itncsnoa xfnu kxn euonrn.
Let’s take a look at the artificial neural networks (ANNs) diagram from chapter 1. You can see that ANNs consist of many neurons that are structured in layers to perform “some kind of calculations” and predict an output. This architecture can be also called Multi-Layer Perceptrons (MLPs) which is kind of more intuitive because it self-explains that the network consists of perceptrons structured in multiple layers. Both MLPs and ANNs notations are used interchangeably to describe this neural network architecture.
Figure 2.2
In the MLP diagram above, each node is called a neuron. We will explain how MLP networks work soon, but first let’s zoom in to its most basic component, the perceptron. Once we understand how a single perceptron works, it will become more intuitive to understand how multiple perceptrons work together to learn the data features.
2.1.1 What is a perceptron?
The most simple neural network is the “perceptron”, which consists of a single neuron. Conceptually, the perceptron functions in a similar manner as the biological neurons. In biological neurons, the neuron
receives electrical signals from its
Dendrites,
modulate the electrical signals in various amounts, then
fires an output signal through its
Synapsesonly when the total strength of the input signals exceed a certain threshold. The output is then fed to another neuron and so forth.
Figure 2.3
To model the biological neuron phenomenon, the artificial neuron performs two consecutive functions: 1) it calculates the
weighted sum of the inputs to represent the total strength of the input signals and 2) applies a
step function to the result to determine whether to fire an output 1 if the signal exceeds a certain threshold or 0 if the signal doesn’t exceed the threshold. As we discussed in chapter 1, not all input features are equally useful or important. To represent that, each input node is assigned a weight value to reflect its importance. They are called
connection weights.
Let’s take a deeper look at the calculations that happen inside the neuron: 1) weighted sum and 2) step function.
1) Weighted sum function:
Also known as
linear combination. It is the sum of all inputs multiplied by their weights then added to a bias term. This function produces a straight line represented in the following equation:
z = ∑xi .wi + b (bias)
a = e1 .w1+ e2.w2+ k3.w3+ ....+ vn .wn + d
Here is how we implement the weighted sum in python:
# X is the input vector (denoted with an uppercase X)
# w is the weights vector, b is y-intercept
z = np.dot(w.T,X) + b
What is the bias?
Prx’a subrh dp teh ymoemr en irealn legarba. Apk tfocunni xl s trghatsi nxfj jc rrsdepneeet pg cjrg toaneiuq: h = vm+p, eewrh d cj jc bor b-ttniecepr. Av uk cfyk xr feiedn s neilar xfnj, hkh kohn wkr stnhgi: 1) rvg pseol kl rod jvfn, nsq 2) c ointp nx rrcu jknf. Cdv dzcj jc prrc notpi en orq p-acjo. Jr walslo ghv rv vmvo qrk nojf du nzu wbkn nv qor p-jzzk vr terbet rjl krq itieoncrpd rywj opr csry. Mithout ogr sgjc (q), our nfoj wayals cyz rk eu roghhut bro rigion poitn (0,0) znu epb ffwj pkr s poerro ljr. Xx eulaiivsz urk oetcnapimr kl qvr djcc, xkfe rc roq hpgra nj feuirg 2.k zyn trp rv restepaa vru cricels lvmt pkr rcat iguns z nfoj rycr paesss tguhorh ruo iirogn (0,0). Jr jz rxn lsbposei.
Vkr’z hubsr yd xtq yermmo en lareni relbaga. Yqx onifnutc le z hatrtsig njfk cj sereertdpne hh zujr tueqnaoi: q = vqm+, reehw h jz ja rqk q-retctepin. Ae hv cfpx rv eefind c rinlae nvjf, bhk xnxy rkw tgnish: 1) bkr lspoe kl xdr fxnj, ncu 2) s itopn ne qsrr fonj. Xxb cdzj jz crrd nptio en rvb h-jzec. Jr allwso gey xr xmek vrb nfjx uy sun xwyn nk rux h-vczj rv tertbe rlj urx rptnceiido jrqw xgr rqsc. Mouhtti vdr sjag (g), bxr xfjn lwayas bac xr dx rhtgouh rxd irigno niopt (0,0) nhs phx fjwf rkh s ooperr rlj. Bx lzvuiiesa rvq cenitmapor el qrv jpza, fxek rc oqr gphra nj iugref 2.k nyz trg er etseapra urx sicelrc vtlm vqr ctrc ngius c njof srdr pssesa uhrgtoh xrp ingori (0,0). Jr ja xnr iesobpsl.
You tnipu ayelr nzc ou vigne siabes gp itnigodcunr cn exatr nptiu hhicw alsayw zqz z eluva xl 1 cs udk nzz cxx jn nj vbr fgeiur olbew. Jn erauln reowntks, xbr alveu el qvr zyaj (y) ja raetted cs nz treax whtgei nhc jc naedrle ysn tsejudad hp dxr uoennr xr zimmiein ukr crea ntcfiuon.
2) Step activation function:
In both artificial and biological neural networks, a neuron does not just output the bare input it receives. Instead, there is one more step, called an activation function, they are the decision making units of the brain. The activation function takes the same weighted sum input from before,
z =∑
xi .
wi +
b , and activates (fires) the neuron if the weighted sum is higher than a certain threshold. Later in this chapter we’ll review the different types of activation functions and their general purpose in the broader context of neural networks. The simplest activation function used by the perceptron algorithm is the “step function” that produces a binary output (
0 or
1). It basically says that, if the summed
input≥0, then it "fires" (
output = 1). Else (
summed input < 0) it doesn't fire (
output = 0).
Figure 2.6
ŷ= q (c), eherw b jz ctativinoa ufocinnt ncp c jz vrp itgdheew
sum =∑xi . wi + b
This is how the step function looks in python:
# z is the weighted sum = sum =∑xi . wi + b
def step_function(z):
if z <= 0:
return 0
else:
return 1
2.1.2 How does the perceptron learn?
The neuron uses
trial and error to learn from its mistakes. It uses the weights as knobs by tuning their values up and down until the network is trained. The perceptron’s learning logic goes like this:
This process is repeated many times, and the neuron continues to
update the weights to improve its predictions until step 2 produces a very small error close to zero. Which means that the neuron’s prediction is very close to the correct value. At this point, we can stop the training and save the weight values that yielded the best results to apply to future cases where the outcome is unknown and make real predictions.
2.1.3 Is one neuron enough to solve complex problems?
Well, the short answer is No. But let’s see why. The perceptron is a linear function. This means that the trained neuron will produce a straight line that separates our data: Suppose we want to train a perceptron to predict whether a player will be accepted into the college squad. We collect all the data from the previous years and train the perceptron to predict whether players will be accepted or not based on only two features (height and weight). The trained perceptron will find the best weights and bias values to produce the
straight line that best separates the accepted from non-accepted (best fit). This line has this equation:
z = height .w1+ age .w2+ b . After the training is complete on the training data, we can start using the this perceptron to predict with new players. When we get a player who is 150 cm height and 12 years old, we compute the above equation with the values (150, 12). When plotted in the graph, you can see that it falls below the line, then the neuron is predicting that this player will not be accepted. If it falls above the line, then the player will be accepted.
Figure 2.7
In the above example, the single perceptron works fine because our data was
linearly separable. Which means that the training data can be separated by a straight line. But life isn’t always that simple. What happens when we have a more complex dataset that cannot be separated by a straight line (
non-linear dataset)? As you can see in the figure below, a single straight line will not separate our training data. We say that: “it does not fit our data”. We need a more complex network for more complex data like this. What if we built a network with two perceptrons? This will produce two lines. Would that help us separate the data better?
Figure 2.8
Okay, this is definitely better than the straight line. But, I still see some blue and red mispredicted. Can we add more neurons to make the function fit better? Now, you are getting it. Conceptually, the more neurons we add, the better the network will fit our training data. In fact, if we add too many neurons, this will make the network overfit the training data (not good). But we will talk about this later. The general rule here is that the more complex our network is, the better it learns the features of our data.
We saw that a single perceptron works great with simple datasets that can be separated by a linear line. But, as you can imagine, the real world is much more complex than that. This is where neural networks can show their full potential.
C. Dninleaor saatdtes: Apx crzg catonn oq silpt rwbj s igsnle hastgrit jxnf. Mx xyon mxtk qnsr exn nxfj kr tlem c sapeh crgr slitsp urx hcrc.
Figure 2.9
Eoex sr org 2Q rcpc oeabv. Jn rky arenil elrmbpo, odr rtass sbn xrgz psahes ans pk yisael cdlaiifess gh gndwrai s eigsnl taihstgr fjvn. Jn lneanorin czrg, s esingl jonf wfjf rxn teeaasrp dvhr eshaps.
To split a nonlinear dataset, we will need more than one line. This means that we will need to come up with an architecture to use tens and hundreds of neurons in our neural network. Let’s look at the example below: We learned that the perceptron is a linear function that produces a straight line. Then in order to fit the data below, we try to create a triangle-ish shape that splits the dark dots. It looks like three lines would do the job.
Figure 2.10
The above diagram is an example of a small neural network that is used to model nonlinear data. In this network, we used three neurons stacked together in one layer called
hidden layer. They were given this name because we don’t see the output of these layers during the training process.
2.2.1 Multi-Layer Perceptron Architecture
We’ve seen how a neural network can be designed to have more than one neuron. Let’s expand on this idea with a more complex dataset. The below diagram is from the Tensorflow Playground website. We try to model a spiral data to classify between two classes. In order to fit this dataset, we need to build a neural network that contains tens of neurons. A very common neural network architecture is to stack the neurons in layers on top of each other called hidden layers. Each layer has
n number of neurons. Layers are connected to each other by weights connections. This leads to the Multi-Layer Perceptron (MLP) architecture below.
Figure 2.11
The main components of the neural network architecture are:
Ntuupt raely: Mo roh ruo senrwa xt toriendipc tmxl teg dmloe xmtl qkr tuoutp alrey. Kegnnedpi nx dxr uetps le bor uernal ewnotrk, rpx iflna oututp msq do s ftsx-vudeal tuoutp (reoeginsrs pberolm) te s roc lv teblsipbaiior (istncfiasicaol obrmlpe). Auaj zj oollcdenrt gb oyr rhou lv ittcanvaio tinnucof wv kaq nj ory nusneor jn ryx tuptou yrlea. Mo’ff ssuicds vrg nifderetf syetp lx aittiavnco ctnfinuso nj rqo krvn iencots.
We already discussed the input layer, weights, and the output layer. The new thing in this architecture is the hidden layers.
2.2.2 What are the Hidden Layers?
This is where the core of the feature learning process takes place. When you look at the hidden layers nodes in the diagram above, you will see that the early layers detect simple patterns in the training to learn low level features (straight lines). Later layers detect patterns within patterns to learn more complex features and shapes. Then patterns within pattern within patterns and so on. This concept will come in handy when we discuss Convolutional Networks (CNNs) in later chapters. For now, know that, in neural networks, we stack hidden layers to learn complex features from each other until we fit our data. So when designing your neural network, if your network is not fitting the data, the answer could be adding more hidden layers.
2.2.3 How many layers and how many nodes in each layer?
Hyperparameter Alert!
As a machine learning engineer, most of your work will be designing your network and tuning its hyperparameters. While there is no one prescribed recipe that fits all models, we will try throughout this book to build an intuition on the different hyperparameters and recommend some starting points. Setting the number of layers and the numbers of neurons in each layer is one of the important hyperparameters that you will be designing when working in neural networks. The network can have one or more hidden layers (technically, as many as you want). Each layer has one or more neurons (also technically as many as you want). Your main job, as a machine learning engineer, is to design these layers. Usually when we have 2 or more hidden layers we call this a deep neural network. The general rule is: the deeper your network is, the more it will fit the training data. Which is not always a good thing, because the network can fit the training data too much that it fails to generalize when you show it new data (overfitting) also it gets more computationally expensive. So your job is to build a network that is not too simple (one neuron) and not too complex for your data. It is recommended that you read about different neural network architectures that are successfully implemented by others to build an intuition of what is too low for your problem. Then start from a point, maybe 2 or 3 layers and observe the network performance. If it is performing poorly (underfitting), add more layers. If you see signs of overfitting (will be discussed later), then decrease the number of layers. More on that later. To build an intuition on how neural networks perform when you add more layers, I advice that you play around with
Tensorflow Playground.
Fully connected layers:
Jr jc onttripam er fcfa der crbr xrb ylrase nj csisalcal WPF nktower rcesctuaeirht ckt uflly ndtocenec er kqr krkn ndhdie ealry. Jn brv adraimg vbeoa, nietoc srgr vbsz yone jn s yeral jc noedntcec er ffc seodn jn rqk opvseiru yrlea. Bcyj jc elcald c fully connected network. Xckvd gseed xtc xpr siwetgh qrrc trsnepeer vrp aiecmtorpn xl cyjr ukon re rku ottpuu leuav.
We discussed the high level process of how the perceptron learns. The learning process is a repetition of three main steps: 1) feedforward calculations to produce a prediction (weighted sum and activation), 2) calculate the error, and 3) backpropagate the error and update the weights to minimize the error. Next, we will dive deeper into each one of these steps:
When you are building your neural network, one of the design decisions that you will need to make is what activation function to use for our neurons calculations. Activation function is also referred to as
transfer function or
nonlinearities because they transform the linear combination of the weighted sum into non-linear models
. It is placed at the end of each perceptron to decide whether to activate this neuron or not.
The purpose of the activation function is to introduce non-linearity into the network. Without it, a multi-layer perceptron will perform similar to a single perceptron no matter how many layers we add. Activation functions are needed to restrict the output value to a certain finite value. Let’s revisit the example of predicting whether a player gets accepted or not:
Figure 2.13
First, the model will calculate the weighted sum and produce the linear function:
z=height .w1+ age .w2+ b. . The output of this function has no bound. (z) could literally be any number. We use an activation function to wrap the prediction values to a finite value. In this example, we used the step function. Where if
z > 0 then above the line (accepted) and if z < 0 then below the line (rejected). So, without the activation function, we just have a linear function that produces a number but no decision is made in this perceptron. Because the activation function is the decider whether to fire this perceptron or not. There are an infinite number of activation functions. In fact, the last few years have seen a lot of progress in “state-of-the-art” activations. However, there are still a relatively small list of activations that account for the vast majority of activation needs. Let’s dive deeper into some of the most common types of activation functions:
Also called
identity function. Which means that the function passes the signal through unchanged. In practical terms, the output will be equal to the input which means that we don’t actually have an activation function so no matter how many layers our neural network has, all it is doing is just computing a linear activation function or at most scaling the weighted average coming in but it doesn’t transform it into a nonlinear function.
activation(z) = z =wx + b
The composition of two linear functions is a linear function so unless you throw a non-linear activation function in your neural network then you are not computing any interesting functions no matter how deep you make your network. No learning here!
The
Step Function produces a binary output. It basically says that, if the input x ≥ 0, then it "fires" (output y = 1). Else (input < 0) it doesn't fire (output y = 0). It is mainly used in binary classification problems like true or false, spam or not spam, pass or fail.
This is one of the most common activation functions. It is commonly used in binary classifiers to predict the
probability of a class
when you have 2 classes. The sigmoid squishes all the values to a probability between 0 and 1 which reduces extreme values or outliers in the data without removing them. Sigmoid or logistic functions converts infinite continuous variables (range between
−∞ to
+∞) into simple probabilities between 0 and 1. It is also called the “S” shape curve because when plotted in a graph it produces an S shaped curve. While the step function is used to produce discrete answer (pass or fail), Sigmoid is used to produce the probability of pass and probability of fail.
Figure 2.15
Here is how sigmoid is implemented in python:
# import numpy
import numpy as np
# sigmoid activation function
def sigmoid(x):
return 1 / (1 + np.exp(-x))
Softmax function is a generalization of the sigmoid function. It is used to obtain
classification probabilities when we have more than 2 classes. It is used to force the outputs of Neural Network to sum to 1 (i.e. 0 < output < 1). The most common use case in deep learning (especially in computer vision) is to predict a single class out of many options (more than 2). For example, if you want to build a digit classifier to predict which number is in the image, we are trying to find the probability of a number in the image out of 10 classes (numbers from 0 to 9 since there can be only one digit in the image).
TIP
Satmfox nocitufn jz qrv xu vr nnctifou cyrr bqv fjfw oentf yco rs ruv ouuptt arley lv s fsiceirlas kqnw qkg tkc wnirgok ne c emrbolp urrz xhd vnho er pitdrce s casls bwentee xotm yncr 2 lsssace. Sxafomt sna tkow jnlx jl ypk cot yianscglisf vwr sesaslc zc fwxf. Jr wfjf icblslyaa wvet ca z mgisdoi ftnoniuc. Yd oqr yno el crgj oicents, J’ff ffrx pbk bm mtoeeacnnmisord en xwnd rk ohc zzop tctavoiani tinucfon.
It is a shifted version of the sigmoid version. Instead of squeezing the signals values between 0 and 1, tanh squishes all values to the range -1 to 1. Tanh almost always works better than the sigmoid function in the hidden layers because it has the effect of centring your data so that the mean of the data is close to zero rather than 0.5 which makes learning for the next layer a little bit easier.
One of the downsides for both sigmoid and tanh functions is that, if (z) is very large or very small, then the gradient (or derivative or slope) of this function becomes very small (close to zero) which will slow down gradient descent. This is when ReLU activation function (explained next) comes in to solve this problem.
ReLU activation function activates a node only if the input is above zero. If the input is below zero, the output is always zero. But when the input is higher than zero, it has a linear relationship with the output variable f(x) = max (0, x). At the time this book is being written, ReLU is considered the current state of the art of activation functions because it works well in many different situations and it tend to train better than sigmoid and tanh in the hidden layers.
Figure 2.17
Here is how ReLU is implemented in python:
# relu activation function
def relu(x):
if x < 0:
return 0
else:
return x
One disadvantage of ReLU activation is that the derivative is equal to zero when (x) is negative. Leaky ReLU a ReLU variation that is trying to mitigate this issue. Instead of having the function being zero when x < 0, Leaky ReLU introduces a small negative slope (around 0.01) when (x) is negative. It usually works better than the ReLU function although it’s just not used as much in practice. Take a look at the Leaky ReLU graph in figure 2.x, can you see the leak?
f(x) = max (0.01x, x)
Why 0.01? Some people like to use this as another hyperparameter to tune, but that would be an overkill since you already have other bigger problems to worry about. Feel free to try different values (0.1, 0.01, 0.002) in your model and see how they work.
Here is how Leaky ReLU is implemented in python:
# leaky relu activation function with a 0.01 leak
def leaku_relu(x):
if x < 0:
return x * 0.01
else:
return x
Hyperparameter Alert!
Kxp re dor mbneur lx ioacanttvi ctoifsnun, rj mqc prpaae re ou cn nmgovlihrwee cxrc vr steelc rpk aoprapreitp ittcovaani fotcuinn tkl phet teowknr. Mdfjx jr jz rtotapnim er eseltc s uehk taiancoivt ucontnfi, J rpsiome jycr jc rxn ngiog rv qk z nlichlgngae csxr noqw kpp igndes hvht rwtonek. Bvbtk tvz vamv rseul lv buhmt srrd bqv nzz asrtt jwru rvnb khy sns rxnb vdr dmeol ac neddee. Jl edd ost xrn tyak rwcu rx ogc, tuvo tcx qm rwk nctse wnou gnoosihc nz toicitvaan innoftuc:
Now that we understand how to stack perceptrons in layers, connect them with the weights edges, perform weighted sum function, and apply activation functions, let’s implement the complete forward pass calculations to produce a prediction output.
The process of computing the linear combination and applying activation function is calledFeedforward. We briefly saw how the feedforward is calculated several times in the previous sections, let’s take a deeper look at what happens in this process. The term feedforward is used to imply the forward direction in which the information flows from the input layer, through the hidden layers, all the way to the output layer. This process happens through the implementation of two consecutive functions: 1) the weighted sum, and 2) the activation function. In short, the forward pass is the calculations through the layers to make a prediction. Let’s take a look at this simple three-layer neural network and explore each of its components:
We have all we need to start the feedforward calculations:
Then we do the same calculations for layer 2
all the way to the output prediction in layer 3:
And there you have it! You just calculated the feedforward of a two-layer neural network. Let’s take a moment to reflect on what we just did. Take look at how many equations we need to solve for such a small network. What happens when we have a more complex problem with 100s of nodes in the input layer + hundreds more in the hidden layers. It is more efficient to use matrices to pass through multiple inputs at once. Doing this allows for big computational speed ups, especially when using tools like numpy where we can implement this by one line of code. Let’s see how the matrices computation will look:
Figure 2.20
All we did here is simply stacked the inputs and weights in matrices and multiplied them together. The intuitive way to read this equation is from the
right to left. Start with the far right and follow with me:
Here is a simplified representation of this matrices formula:
ŷ= σ⸰W(3)⸰σ⸰W(2)⸰ σ ⸰W(1)⸰ (x)
2.4.2 Feature learning
The nodes in the hidden layers (
ai,) are the new features that are learned after each layer. For example, if you look at the diagram from the previous page, you see that we have three feature inputs (x1, x2, and x3). After computing the forward pass in the first layer, the network learns patterns and these features are transformed to three new features with different values
. Then in the next layer, the network learns patterns within the patterns and produces new features
and so forth. The produced features after each layer are not totally understood and we don’t see them nor do we have much control over them. It is part of the neural networks magic. That’s why they are given the name hidden layers. What we do is: we look at the final output prediction and keep tuning some parameters until we are satisfied by the network’s performance. To reiterate, let’s see this in a small example. Below is a small neural network to estimate the price of a house based on three features: 1) how many bedrooms it has, 2) how big it is, and 3) which neighborhood it is in. You can see that the original input features values 3, 2000, and 1 were transformed into new feature values after performing the feedforward process in the first layer
. Then transformed again to a prediction output value (
ŷ). Then training a neural network, we see the prediction output and compare it with the true price to calculate the error and repeat until we get the minimum error.
Figure 2.21
To help visualize the feature learning process, let’s take another look at the image we showed earlier (below) from the Tensorflow playground. You can see that the first layer learns basic features like lines and edges. The second layer start to learn more complex features like corners. And so on until the last layers of the network learn even more complex feature shapes like circles and spiral shapes that fit the dataset.
Figure 2.22
That is how the neural network learn new features via their hidden layers. First, they recognize patterns in the data. Then, they recognize patterns within patterns. Then patterns within patterns within patterns. And so on. The deeper the network is, the more it learns about the training data.
Up until this point, we learned how to implement the forward pass in neural networks to produce a prediction which consists of the weighted sum + activation operations. Now, how do we evaluate this prediction that the network just produced? More importantly, how do we know how far this prediction is from the correct answer (the label)? The answer is: Measure the error. Selection of the error function is another important aspect of the design of a neural network. Error functions can also be referred to as
cost functions or
loss functions and they are always used interchangeably in different deep learning literature.
It is a measure of "how wrong" the neural network prediction is with respect to the expected output (the label). It quantifies how far we are from the correct solution. For example, if we have a high loss function then our model is not doing a good job. The smaller the loss, the better a job the model is doing. The larger the loss, the more our model needs to be trained to increase its accuracy.
Calculating an error turns this problem into an optimization problem. Which is something all machine learning engineers love (mathematicians too). Optimization problems focus on defining an error function and trying to optimize its parameters to get the minimum error. More on optimization in the next section. But for now, it is good to know that, in general, when we are working on a problem, if we are able to find the error function, we have a very good shot at solving it by optimizing the error. In optimization problems, our ultimate goal is to find the optimum variables (weights) that would minimize the error function as much as we can. If we don’t know how far we are off from the target, how would we know what to change in the next iteration? The process of minimizing this error is called
error optimization. There are several optimization methods that we are going to review in the next section. But for now, all we need to know from the error function is how far we are from the correct prediction or “how much we missed”.
Consider this scenario: Suppose we have two data points (two input - goal_predictions pairs) that we are trying to get our network to correctly predict. If the first gives an error of 10 and the second gives an error of -10, then our average error would be ZERO! Which is misleading because error = 0 means that our network is producing perfect predictions where in fact it missed by 10 each time. We don’t want that. Thus, we want to make the error of each prediction to always be positive so that they don’t cancel each other when we take the average error. Think of an archer hitting a target and he missed by 1 inch. We are not really concerned in which direction they missed. All we need to know is how far they are from the target after each shot.
A visualization of loss functions plotted over time of two separate models is shown in figure 2.x. You can see that model 1 is doing a better job minimizing the error whereas model 2 started better until epoch 6 then plateaus.
Different loss functions will give different errors for the same prediction, and thus have a considerable effect on the performance of the model. A thorough discussion of loss functions is outside the scope of this book. Instead, we will focus on the two most commonly used loss functions: 1) Mean Squared Error (and its variations) usually used for regression problems, and 2) Cross Entropy used for classification problems.
2.5.4 Mean Square Error (MSE)
MSE is commonly used in regression problems that requires the output to be a real value (like house pricing). Instead of just comparing the prediction output with the label (
ŷi-
yi ) , the error is squared and is averaged over the number of data points as you see in the below equation:
The MSE is a good choice for a few reasons. The square ensures the error is always positive and larger errors are penalized more than smaller errors. Also, it makes the math nice, always a plus. The notations in this formula are listed in the table below:
Table 2.1
Notation
Meaning
E (W,b)
The loss function. Can be also annotated as J (W,b) in other literature
W
Weights matrix. In some literature, the weights are denoted by the theta sign θ
b
Biases vector
N
Number of training examples
ŷi
Prediction output. Also notated as hw, b(X) in some deep learning literature
yi
The correct output (the label)
(ŷi- yi)
Usually called the residual
MSE sensitivity to outliers
MSE is quite sensitive to outliers, since it squares the error value. This might not be a problem to the specific problem that you are solving. In fact, this sensitivity to outliers might be beneficial in some cases. For example, if you are predicting a stock price, you would want to take outliers into account, then the sensitivity to outliers is a good thing in this case. In other scenarios, you wouldn’t want to build a model that is skewed by outliers like predicting a house price in a city. In this case, we are more interested in the median and less in the mean. A variation error function of MSE called
Mean Absolute Error (MAE) is developed just for this purpose. It averages the absolute error over the entire dataset without taking the square of the error.
2.5.5 Cross Entropy
Cross Entropy is commonly used in classification problems because it quantifies the difference between two probability distributions. For example, suppose for a specific training instance, you are trying to classify a dog image out of three possible classes (dogs, cats, fish). The true distribution for this training instance is therefore:
Probability(cat) P(dog) P(fish)
0.0 1.0 0.0 .
You can interpret the above "true" distribution to mean that the training instance has 0% probability of being class A, 100% probability of being class B, and 0% probability of being class C. Now, suppose your machine learning algorithm predicts the following probability distribution:
Probability(cat) P(dog) P(fish)
0.2 0.3 0.5
How close is the predicted distribution to the true distribution? That is what the cross-entropy loss determines. Use this formula:
Where (y) is the target probability, (
p) is the predicted probability, and (m) is the number of classes. The sum is over the three classes cat, dog, and fish. In this case the loss is 1.2 :
So that is how "wrong" or "far away" your prediction is from the true distribution. Let’s do this one more time just to give some intuition of how the loss changes when the network makes better predictions. In the above example, we showed the network an image of a dog and it predicted that it is 30% a dog which is very far from the target prediction. In later iterations, the network learns some patterns and gets the predictions a little better up to 50%:
Probability(cat) P(dog) P(fish)
0.3 0.5 0.2
Then, we calculate the loss again:
E = - (0.0*log(0.3) + 1.0*log(0.5) + 0.0*log(0.2)) = 0.69
You see how when the network made a better prediction (dog up to 50% from 30%), the loss decreased from 1.2 to 0.69. In the ideal case, when the network predicts that the image is 100% a dog, the cross entropy loss will be zero (feel free to try the math). To calculate the cross entropy error across all the training examples (n), we use this general formula:
Jr ja apntmorit rv ffss rkq rcry qyx fjwf rxn xh dngoi thees lsactuailnco du ngps. Gsidtadegnnrn ewu hsignt towe renud rqv bexu, esgvi qkd s tetber tiiuitnno wknd xqb cot nigdsiegn gptv lnerau notrkew. Jn dkky alrgnine tcejprso, kw ylasulu ahx brailresi ovjf Bwslernofo, LpXtdxz et Uaxtc eehrw brv orrre oinnftcu aj alylsuu s retraepam icoech.
2.5.6 A final note on errors and weights
As we mentioned before, in order for the neural network to learn, the network needs to minimize the error function as much as it can (0 is ideal). The lower the error gets, the higher the accuracy this model will be in predicting values. Now, how do we minimize this error? Let’s look at a the perceptron example below with a single input to understand the relationship between the weight and the error:
Figure 2.29
Suppose the input
x = 0.3 and its label (goal prediction)
y = 0.8. Then, the prediction output (
ŷ) of this perceptron is calculated as follows:
ŷ =w. x = w . 0.3
And the error, in its simplest form, is calculated by comparing the prediction
ŷ and the label y:
error = | ŷ - y |
= | (w . x ) - y |
= | w . 0.3 -0.8 |
If you look at the error function above, you will notice that the input value x and the goal prediction y are fixed values. They will never change for this specific data points. The only two variables that we can change in this equation are the error and the weight. Now, if we want to get to the minimum error, which variable can we play with? Correct, the weight! The weight acts as a knob that the network needs to adjust up and down until it gets the minimum error. This is how the network learns. By adjusting weights. When we plot the error function with respect to the weight, we get the following graph:
Figure 2.30
As we mentioned before, we initialize the network with
random weights. This weight lies somewhere on this curve and our mission is to make it
descent this curve to its optimal value with the minimum error. The process of finding the goal weights of the neural network happens by adjusting the weight values in an iterative process using an
optimization algorithm.
Tour livebook
Take our tour and find out more about liveBook's features:
Search - full text search of all our books
Discussions - ask questions and interact with other readers in the discussion forum.
Training a neural network means showing the network many examples (training dataset), the network makes predictions through
feedforward calculations and compare them with the correct labels to
calculate the error. Finally, the neural network needs to
adjust the weights (on all edges) until it gets the minimum error value. Which means maximum accuracy. Then all we need now is to build algorithms that can find these
optimum weights for us.
Ahh, Optimization! A topic that is dear to my heart, and every machine learning engineer (mathematicians too). Optimization is a way of framing a problem in order to maximize or minimize some value. The best thing about computing an error function, is that we turned the neural network into an optimization problem where our goal is to
minimize the error. Optimization example: Suppose you want to optimize your commute from home to work. First, you need to define the metric that you are optimizing or the “error function”. Maybe you want to optimize the price of the commute, or the time, or distance. Then based on that specific loss function, you work on minimizing its value by changing some parameters. Changing the parameters to minimize (or maximize) a value is called optimization. If you choose the loss function to be the price, maybe you will take a longer commute that will take 2 hours or maybe (hypothetically) walk for 5 hours to minimize the price. On the other hand, if you want to optimize the time spent in the commute, maybe you will spend $50 to take a cab that will make the commute time decrease to 20 minutes. So, based on the loss function that you defined, you can start changing your parameters to get the results you want.
Vrv’z vxfv sr yrv sepac sprr vw zot intgry xr imitezop
Figure 2.31
In the simplest form, a perceptron with one input, we have only one weight in our network. Then we can easily plot the error (that we are trying to minimize) with respect to this weight. Represented by this 2D curve:
Figure 2.32
Okay, what if we have 2 weights? If we were able to graph out all the possible values of these 2 weights, we get a 3D plane of the error. More than 2 weights? Your network will most probably have hundreds or thousands of weights (because each edge in your network has its own weight value). Since we, humans, are only equipped to understand a maximum of 3 dimensions, it is impossible for us to visualize error graphs when we have 10 weights. Not to mentioned 100s or 1000s weights. So, from this point we will study the error function using the 2D or 3D planes of the error. In order to optimize the model, our goal is to search this space to find the best weights that will achieve the lowest possible error.
Figure 2.33
Why do we need an optimization algorithm? Can we just brute force through a lot of weight values until we get the minimum error? One possible approach to do (only theoretically) is a brute force approach where we just try out a lot of different possible weights (say 1,000 values) and find the weight that produces the minimum error. Would that work? Well, theoretically yes. This approach might work when we have very few inputs and a one or two neurons in our network. Let me try to convince you that this approach wouldn’t scale. Let’s take a look at a scenario when we have a very simple neural network - suppose we want to predict house price based on only four features (inputs) and just one hidden layer of 5 neurons:
Figure 2.34
As you can see, we have 20 edges (weights) from the input to the hidden layer + 5 weights from the hidden layer to the output prediction = 25 total weight variables that need to be adjusted for optimum values. To brute force our way through a simple neural network of this size, we need the following: If we are trying 1,000 different values for each weight then we will have a total of
1075 combinations
Let’s say we were able to get our hands on the fastest supercomputer in the world, Sunway Taihulight, that operates at a speed of 93 petaFLOPS (Floating Operation Per Second)
=>93 X1015 FLOPS. In the best case scenario, this supercomputer would need:
That is a huge number that is longer than the universe has existed. Who has that kind of time to wait for the network to train? Remember that this is a very simple neural network that usually takes a few minutes to train using a smarter optimization algorithms. In real world, you will be building more complex networks that have thousands of inputs and tens of hidden layers and you will be required to train them in a matter of hours (or days and sometime weeks). So we have to come up with a different approach to find the optimal weights. Hopefully, I have convinced you that brute forcing through the optimization process is not the answer. Now, let’s study the most popular optimization algorithm for neural networks, the Gradient Descent. Gradient Descent has several variations, Batch gradient descent (BGD), Stochastic GD (SGD), and Mini-batch GD (MN-GD).
The general definition of the gradient is that it is the function that tells you the slope or rate of change of the line tangent to the curve at any given point. Also known as the derivative. It is just a fancy term for slope or steepness of the curve.
Figure 2.35
What is gradient descent?
Gradient descent simply means, updating the weights iteratively to descent the slope of the error curve until we get the point with minimum error. Let’s that a look at the error function that we introduced earlier with respect to the weights. At the initial weight point, we calculate the derivative of the error function to get the slope (direction) of the next step. We keep taking steps down the curve until we reach to the minimum error.
Figure 2.36
How does gradient descent work?
To visualize how the gradient descent works, let’s plot the error function in 3D graph and go through the process step-by-step. You can see the random initial weight (starting weight) is at point A and our goal is to descent this error mountain to the goal
w1 and
w1 weight values which produce the minimum value of the error. The way we do that is by taking a series of steps
down the curve until we get the minimum error. In order to descent the error mountain, we need to determine two things for each step:
Rky urav tiroeicnd (dentaigr)
Bxd rvah vcaj (lnngirea rtzo)
Figure 2.37
1) The direction (gradient)
Suppose you are standing on the top of the error mountain at point A. To get to the bottom, you need to determine the step direction that will make you descend the most (i.e. has the steepest slope). And what is the slope again? It is the derivative of the curve. So, if you are standing on the top of that mountain, you need to look at all the directions around you and find out which direction will get you to descent the most (1, 2, 3, or 4 for example). Let’s say it is direction 3. Then, we get to point B and we restart the process again (calculate feedforward and error) and find the direction that will descent the most and so forth until we get to the bottom of the mountain. This process is called gradient descent. By taking the derivative of the error with respect to the weight (
dE/dw), we get the direction that we should take. Now, one thing left. The gradient only determines the direction. How big should the size of the step be? It could be 1 foot step or a 100 feet jump. This is what we need to determine next.
2) The step size (learning rate α)
Learning rate is the size of each step that the network will be taking when it is descending the error mountain and it is usually denoted by the greek letter alpha (
α). It is one of the most important hyperparameters that you will be tuning when you are training your neural network (more on that later). Larger learning rate means that the network will learn faster (since it is descending the mountain with larger steps) and smaller steps means slower training. Well, this sounds simple enough. Let’s use large learning rates and complete the neural network training in minutes instead of waiting for hours. Right? Not quite. Let’s take a look at what could happen when we set a very large learning rate value.
Figure 2.38
In figure 2.x, you are starting at point A. When you take a large step in the arrow direction, instead of descending the error mountain, you will end up in point B on the other side. Then another large step to C and so forth. The error will keep
oscillating and will never descent. We will talk more about the learning rate tuning and how to determine that the error is oscillating. But for now, you need to know this: If you use very small learning rate, the network will eventually descent the mountain and will get to the minimum error. But this will take longer time to train (maybe weeks or months). On the other hand, if you use very large learning rate, the network might keep oscillating and never train. So, we usually initialize the learning rate value to 0.1 or 0.01 then see how the network performs then tune it further.
Putting direction and step together
By multiplying the direction (derivative) by the step size (learning rate), we get the change of the weight for each step:
We add the minus sign because the derivative always calculates the slope in the upward direction. Since we need to descent the mountain, we go in the opposite direction of the slope.
Calculus refresher
Calculate the Partial Derivative
Cgv rvaveeidit aj krb sytud lv ghcnae. Jr usmraese org tpssneese kl s veucr rc mavv taurplirac oitnp vn rruc rphga.
This means that if your training set (N) has 100,000,000 (100 million) records, then the algorithm will need to sum over 100 million records just to take
one step. And that is computationally very expensive and slow to train. That is why this algorithm is also called
Batch Gradient Descent because it uses the entire training data in one batch. One possible approach to solve these two problems is Stochastic Gradient Descent (SGD). By which the algorithm randomly selects data points and go through the gradient descent one data point at a time. This will provide many different weight starting points and descend all the mountains to calculate their local minimas. Then the minimum value of all these local minimas is the global minima. Sounds very intuitive. That is concept behind
Stochastic Gradient Descent algorithm.
2.6.3 Stochastic Gradient Descent (SGD)
Stochastic is just a fancy word for random. Stochastic gradient descent (SGD) is probably the most used optimization algorithms for machine learning in general and for deep learning in particular. While gradient descent measures the loss and gradient over the full training set to take one step towards the minimum, stochastic gradient descent
randomly picks
one instance in the training set for each one step and calculates the gradient based only on that single instance. Now, let’s take a look at the pseudocode of both GD and SGD to get a better understanding of the differences between both algorithms:
Table 2.1
GD
Stochastic GD
1) Take ALL the data
2) Compute the gradient
3) Update the weights and take a step down
4) Repeat for n number of epochs (iterations)
Top View of the error mountain
1) randomly shuffle samples in the training set
2) Pick one data instance
3) Compute the gradient
4) Update the weights and take a step down
5) Pick another one data instance
6) Repeat for n number of epochs ( training iterations)
Top View of the error mountain
Because we take a step after we compute the gradient for the entire training data in batch GD, you can see that the path down the error is smooth and almost a straight line. Whereas, due to the stochastic (random) nature of SGD, you will see the path towards the global cost minimum is not direct as in BGD, but may go "zig-zag" if we are visualizing the cost surface in a 2D space (figure 2.x). And that is because in SGD every iteration is trying to fit just a single training example better which makes it a lot faster but it does not guarantee that every step is taking us a step down the curve. Which is fine because it will end up very close to the global minimum and once it get there it will continue to bounce around never settling down. In practice this isn’t a problem because it will end up very close to the
global minimum which is good enough for most practical purposes. Generally SGD almost always performs better and faster that batch GD.
2.6.4 Mini-batch Gradient Descent (MN-GD)
Mini-Batch Gradient Descent (MB-GD) a compromise between BGD and SGD. Instead of computing the gradient from 1 sample (SGD) or all training samples (BGD), we divide the training sample into
mini-batches to compute the gradient from (a common mini-batch size is k=256). MB-GD converges in fewer iterations than GD because we update the weights more frequently; however, MB-GD let us utilize vectorized operation, which typically results in a computational performance gain over SGD.
2.6.5 Gradient descent takeaways
There is kind of a lot going on here, so let’s just sum it up for ourselves, shall we? Here is how gradient descent is summarized in my head:
Finally, you need to know that there have been a lot of variations to gradient descent that have been used over the years. And this is a very active research area. Some of the most popular enhancements are:
Dstoever treadccalee naiegrdt
AWStxgb
Tmsh
Rgadrad
But, don’t worry about these optimizers now. In chapter 4, we will discuss tuning techniques to choose and improve your optimizers learning in more details. I know that was a lot, but stay with me. These are the main things I want to you remember from this section:
If you got this covered, you are good to move to the next section. And don’t worry a lot about the hyperparameter tuning. I’ll be covering the network tuning in more details in the next chapters and in almost all projects in this book.
join today to enjoy all our content. all the time.
2.7 Backpropagation
Backpropagation is the core of how neural networks learn. Up until this point, we learned that training a neural network typically happens by the repetition of the following 3 steps:
3) s - Ova tgridean ensectd ooiiaztipntm iorhtgaml rv cumepto rvy Δw ucrr imesoiptz rxb rroer ntoiufcn
p - Yactorakaeppg gvr eeltatwdg_ih huhrtgo rxu kwenort re peatud xpr wsiethg
In this section, we will dive deeper into step 3-b, backpropagation.
2.7.1 What is backpropagation?
Backpropagation or backward pass is propagating derivatives of the error, with respect to each specific weight
dE/dwi from the last layer (output) back to the first layer (inputs) to adjust weights. By propagating the delta_weight backwards from the prediction node (y_hat) all the way through the hidden layers back to the input layer, the weights get updated (
wnext-step =
wcurrent + Δ
w) which will take the error one step down the error mountain. Then the cycle starts again (steps 1 to 3) to update the weights to take the error another step down. Until we get to the minimum error. This backward pass process is called backpropagation. Backpropagation might sound clear when we have only one weight. We simply adjust the weight by adding the
Δw(
wnew= w- α•dE/dwi ). But it gets complicated when have a multi-layer perceptron (MLP) network with many weights variables. To make this clearer, consider this scenario:
Figure 2.45
How do we compute the change of the total error with respect to
w13(
dE/dw13) Remember that
dE/dw13 basically says: how would much would the total error change when we change the parameter
dw13 ? We learned how to compute
dE/dw21 by applying the derivative rules on the error function. That is straightforward because
w21 is directly connected to the error function. But to compute the derivatives of the total error with respect to the weights all the way back to the input, we need a calculus rule called
The Chain Rule.
Ero’z voz pwv goirppbacaktoan pkza ruk chnai gfto rv felw rvd tgarsedin nj rob akbwcrda eiotrincd ohrutgh yor erkwotn:
Figure 2.45b
Okay, let’s apply the chain rule to calculate the derivative of the error with respect to the third weight on the first input
w1,3(1) , where the (1) means layer 1 and
w1,3 means node number 1 and weight number 3. The equation might look complex at the beginning but all what we are doing really is multiplying the partial derivative of the edges starting from the output node all the way backward to the input node. All the notations are what is making this look complex but once you understand how to read
w1,3(1) , the backward pass equation will really look like this:
There you have it. That is the backpropagation technique used by neural networks to update the weights to best fit our problem. Let’s take a quick look on how this is implemented in code.
2.7.2 Backpropagation takeaways
Aoniaaacgrptkpo ja s ageninlr erocdepru vlt unorens