1 The mechanics of learning

published book

This chapter covers

  • Using Google Colab for coding
  • Introducing PyTorch, a tensor-based API for deep learning
  • Running faster code with PyTorch’s GPU acceleration
  • Understanding automatic differentiation as the basis of learning
  • Using the Dataset interface to prepare data

Deep learning, also called neural networks or artificial neural networks, has led to dramatic advances in machine learning quality, accuracy, and usability. Technology that was considered impossible 10 years ago is now widely deployed or considered technically possible. Digital assistants like Cortana, Google, Alexa, and Siri are ubiquitous and can react to natural spoken language. Self-driving cars have been racking up millions of miles on the road as they are refined for eventual deployment. We can finally catalog and calculate just how much of the internet is made of cat photos. Deep learning has been instrumental to the success of all these use cases and many more.

This book exposes you to some of the most common and useful techniques in deep learning today. A significant focus is how to use and code these networks and how and why they work at a deep level. With a deeper understanding, you’ll be better equipped to select the best approach for your problems and keep up with advances in this rapidly progressing field. To make the best use of this book, you should be familiar with programming in Python and have some passing memory of calculus, statistics, and linear algebra courses. You should also have prior experience with machine learning (ML), although it is OK if you aren’t an expert; ML topics are quickly introduced, but our goal is to move into details about deep learning.

Let’s get a clearer idea of what deep learning is and how this book teaches about it. Deep learning is a subdomain of ML, which is a subdomain of artificial intelligence (AI). (Some may take offense at how I’m categorizing these groups. It’s an oversimplification.) Broadly, we could describe AI as getting computers to make decisions that look smart. I say look because it is hard to define what smart or intelligence truly is; AI should be making decisions that we think are reasonable and what a smart person would do. Your GPS telling you how to get home uses some old AI techniques to work (these classic tried-and-true methods are sometimes called “good old-fashioned AI," or GOFAI), and taking the fastest route home is a smart decision. Getting computers to play video games has been accomplished with purely AI-based approaches: only the rules of the game are encoded; the AI does not need to be shown how to play a game of chess. Figure 1.1 shows AI as the outermost layer of these fields.

Figure 1.1 A (simplified) hierarchy of AI, ML, and deep learning

With ML, we start to give AI examples of previous smart and not-so-smart decisions. For example, we can improve our chess-playing AI by giving it example games played by chess grandmasters to learn from (each game has a winner and a loser—a smart and a not-as-smart set of decisions). This is a supervised-centric definition, but the critical component is that we have data that reflects the real world.

Note

A common saying is that data is truth, but that’s also an oversimplification. Many biases can impact the data you receive, giving you a biased view of the world. That is an advanced topic for another book!

Deep learning, in turn, is not one algorithm, but hundreds of small algorithms that act like building blocks. Part of being a good practitioner is knowing what building blocks are available and which ones to stick together to create a larger model for your problem. Each building block is designed to work well for certain problems, giving the model valuable information. Figure 1.2 shows how we might combine blocks together to tackle three situations. One of the goals in this book is to cover a wide variety of building blocks so that you know and understand how they can be used for different kinds of problems. Some of the blocks are generic (“Data is a sequence” could be used for literally any kind of sequence), while others are more specific (“Data is an image” applies to only images), which impacts when and how you use them.

Figure 1.2 A defining characteristic of deep learning is building models from reusable blocks. Different blocks are useful for different kinds of data and can be mixed and matched to deal with different problems. The first row shows how blocks of the same type can be repeated to make a deeper model, which can improve accuracy.

The first row uses two “Data is an image” blocks to create a deep model. Applying blocks repeatedly is where the deep in deep learning comes from. Adding depth makes a model capable of solving more complex problems. This depth is often obtained by stacking the same kind of block multiple times. The second row in the figure shows a case for a sequence problem: for example, text can be represented as a sequence of words. But not all words are meaningful, and so we may want to give the model a block that helps it learn to ignore certain words. The third row shows how to describe new problems using the blocks we know about. If we want our AI to watch a video and predict what is happening (e.g., “running,” “tennis,” or “adorable puppy attack”) we can use the “Data is an image” and “Data is a sequence” blocks to create a sequence of images—a video.

These building blocks define our model, but as in all ML, we also need data and a mechanism for learning. When we say learning, we are not talking about the way humans learn. In machine (and deep) learning, learning is the mechanical process of getting the model to make smart-looking predictions about data. This happens via a process called optimization or function minimization. Before we see any data, our model returns random outputs because all of the parameters (the numbers that control what is computed) are initialized to random values. In a common tool like linear regression, the regression coefficients are the parameters. By optimizing the blocks over the data, we make our models learn. This gives us the larger picture in figure 1.3.

Figure 1.3 The “car” of deep learning. The car is built from many different building blocks, and we can use assortments of building blocks to build cars for different tasks. But we need fuel and wheels to make the car go. The wheels are the task of learning, which is done via a process called optimization; and the fuel is data.

In most chapters of this book, you learn about new building blocks that you can use to build deep learning models for different applications. You can think of each block as a kind of (potentially very simple) algorithm. We discuss the uses of each block and explain how or why they work and how to combine them in code to create a new model. Thanks to the nature of building blocks, we can ramp up from simple tasks (e.g., simple prediction problems you could tackle with a non-deep ML algorithm) to complex examples like machine translation (e.g., having a computer translate from English to French). We start with basic approaches and methods that have been used to train and build neural networks since the 1960s, but using a modern framework. As we progress through the book, we build on what we’ve learned, introducing new blocks, extending old blocks, or building new blocks from existing ones.

That said, this book is not a cookbook of code snippets to throw at any new problem. The goal is to get you comfortable with the language that deep learning researchers use to describe new and improved blocks so you can recognize when a new block may be useful. Math can often express complex changes succinctly, so I will be sharing the math behind the building blocks.

We won’t do a lot of math—that is, derive or prove the math. Instead, I show the math: present the final equations, explain what they do, and attach helpful intuition to them. I’m calling it intuition because we go through the bare minimum math needed. Explaining the high-level idea of what is happening and showing why a result is the way it is would require more math than I’m asking you to have. As I show the equations, I interweave corresponding PyTorch code whenever possible so you can start to build a mental map between the equations and the deep learning code that implements them.

This chapter first introduces our compute environment: Google Colab. Next, we talk about PyTorch and tensors, which is how we represent information in PyTorch. After that, we dive into the use of graphics processing units (GPUs), which make PyTorch fast, and automatic differentiation, which is the “mechanics” that PyTorch uses to make neural network models learn. Finally, we quickly implement a dataset object that PyTorch needs to feed our data into the model for the learning process. This gives us the fuel and wheels to get our deep learning car moving, starting in chapter 2. From there on, we can focus on just the deep learning.

This book is designed to be read linearly. Each chapter uses skills or concepts developed in the preceding chapters. If you are already familiar with the concepts in a chapter, feel free to jump ahead to the next one. But if deep learning is new to you, I encourage you to proceed one chapter at a time instead of jumping to one that sounds more interesting, as these can be challenging concepts, and growing one step at a time will make it easier overall.

livebook features:
highlight, annotate, and bookmark
Select a piece of text and click the appropriate icon to annotate, bookmark, or highlight (you can also use keyboard shortcuts - h to highlight, b to bookmark, n to create a note).

You can automatically highlight by performing the text selection while keeping the alt/ key pressed.
highlights
join today to enjoy all our content. all the time.
 

1.1 Getting started with Colab

We will use GPUs for everything we do with deep learning. It is, unfortunately, a computationally demanding practice, and GPUs are essentially a requirement for getting started, especially when you start to work on larger applications. I use deep learning all the time as part of my work and regularly kick off jobs that take a few days to train on multiple GPUs. Some of my research experiments can take a month of compute for each run.

Unfortunately, GPUs are expensive. Currently, the best option for most people who want to get started with deep learning is to spend $600–$1,200 on a higher-end NVIDIA GTX or Titan GPUs. That is, if you have a computer that can be expanded/upgraded with a high-end GPU. If not, you are probably looking at at least $1,500–$2,500 to build a nice workstation to put those GPUs in. That’s a steep cost just to learn about deep learning.

Google’s Colab (https://colab.research.google.com) provides a GPU for free for a limited amount of time. I’ve designed every example in this book to run within Colab’s time limits. The appendix contains the instructions for setting up Colab. Once you have it set up, the common data science and ML tools like seaborn, matplotlib, tqdm, and pandas are all built-in and ready to go. Colab operates like a familiar Jupyter notebook, where you run code in cells that produce output directly below. This book is a Jupyter notebook, so you can run the code blocks (like the next one) to get the same results (I’ll let you know if a code cell isn’t meant to be run):

import seaborn as sns 
import matplotlib.pyplot as plt 
import numpy as np 
from tqdm.autonotebook 
import tqdm import pandas as pd

As we progress through this book, I do not repeatedly show all the imports, as that would be mostly a waste of paper. Instead, they are available online as part of the downloadable copy of the code, which can be found at https://github.com/EdwardRaff/Inside-Deep-Learning.

livebook features:
discuss
Ask a question, share an example, or respond to another reader. Start a thread by selecting any piece of text and clicking the discussion icon.
discussions
Get Inside Deep Learning
add to cart

1.2 The world as tensors

Deep learning has been used on spreadsheets, audio, images, and text, but deep learning frameworks don’t use classes or objects to distinguish between kinds of data. Instead, they work with one data type, and we must convert our data into this format. For PyTorch, this singular view of the world is through a tensor object. Tensors are used to represent both data, the inputs/outputs to any deep learning block, and the parameters that control the behavior of our networks. Two essential features are built into tensor objects: the ability to do fast parallel computation with GPUs and the ability to do some calculus (derivatives) automatically. With prior ML experience in Python, you should have prior experience with NumPy, which also uses the tensor concept. In this section, we quickly review the tensor concept and note how tensors in PyTorch differ from NumPy and form the foundation for our deep learning building blocks.

We begin by importing the torch library and discussing tensors, which are also called n-dimensional arrays. Both NumPy and PyTorch allow us to create n-dimensional arrays. A zero-dimensional array is called a scalar and is any single number (e.g., 3.4123). A one-dimensional array is a vector (e.g., [1.9, 2.6, 3.1, 4.0, 5.5]), and a two-dimensional array is a matrix. Scalars, vectors, and matrices are all tensors. In fact, any value of n for an n-dimensional array is still a tensor. The word tensor refers to the overall concept of an n-dimensional array.

We care about tensors because they are a convenient way to organize much of our data and algorithms. This is the first foundation that PyTorch provides, and we often convert NumPy tensors to PyTorch tensors. Figure 1.4 shows four tensors, their shapes, and the mathematical way to express the shape. Extending the pattern, a four-dimensional tensor could be written as (B, C, W, H) or as B, C, W, H.

Figure 1.4 Examples of tensors, with more dimensions or axes as we move from left to right. A scalar represents a single value. A vector is a list of values and is how we often think about one datapoint. A matrix is a grid of values and is often used for a dataset. A three-dimensional tensor can be used to represent a dataset of sequences.

We use common notation to associate math symbols with tensors of a specific shape. A capital letter like X or Q represents a tensor with two or more dimensions. If we are talking about a vector, we use a lowercase bold letter like x or h. Last, we use a lowercase non-bold letter like x or h for a scalar.

In talking about and implementing neural networks, we often refer to a row in a larger matrix or a scalar in a larger vector. This is shown in figure 1.5 and is often called slicing. So if we have a matrix X, we can use xi to reference the ith row of X. In code, that is x_i = X[i,:]. If we want the ith row and jth column, it becomes xi, j, which is not bold because it is a reference to a single value—making it a scalar. The code version is x_ij = X[i,j].

Figure 1.5 A tensor can be sliced to grab sub-tensors from a larger one. For example, in red, we grab a row-vector from the larger matrix; and in blue, we grab a column-vector from the matrix. Depending on what the tensor represents, this can let us manipulate different parts of the data.

To use PyTorch, we need to import it as the torch package. With it, we can immediately start creating tensors. Every time you nest a list within another list, you create a new dimension of the tensor that PyTorch will produce:

import torch

torch_scalar = torch.tensor(3.14) 
torch_vector = torch.tensor([1, 2, 3, 4]) 
torch_matrix = torch.tensor([[1, 2,], 
                             [3, 4,], 
                             [5, 6,], 
                             [7, 8,]])     #1 
torch_tensor3d = torch.tensor([ 
                             [ 
                             [ 1, 2, 3], 
                             [ 4, 5, 6], 
                             ], 
                             [ 
                             [ 7, 8, 9], 
                             [10, 11, 12], 
                             ], 
                             [ 
                             [13, 14, 15], 
                             [16, 17, 18], 
                             ], 
                             [
                             [19, 20, 21],
                             [22, 23, 24],
                             ] 
                               ])

If we print the shapes of these tensors, you should see the same shapes shown earlier. Again, while scalars, vectors, and matrices are different things, they are unified under the larger umbrella of tensors. We care about this because we use tensors of different shapes to represent different types of data. We get to those details later; for now, we focus on the mechanics PyTorch provides to work with tensors:

print(torch_scalar.shape) 
print(torch_vector.shape) 
print(torch_matrix.shape) 
print(torch_tensor3d.shape)

torch.Size([]) 
torch.Size([4]) 
torch.Size([4, 2]) 
torch.Size([4, 2, 3])

If you have done any ML or scientific computing in Python, you have probably used the NumPy library. As you would expect, PyTorch supports converting NumPy objects into their PyTorch counterparts. Since both of them represent data as tensors, this is a painless process. The following two code blocks show how we can create a random matrix in NumPy and then convert it into a PyTorch Tensor object:

x_np = np.random.random((4,4)) 
print(x_np) 
[[0.05095622 0.64330091 0.98293797 0.27355789]
[0.37754388 0.51127555 0.29976254 0.97804978]
[0.28363853 0.48929802 0.77875258 0.19889717]
[0.23659932 0.21207824 0.25225453 0.54866766]]

x_pt = torch.tensor(x_np) 
print(x_pt) 
tensor([[0.0510, 0.6433, 0.9829, 0.2736],
        [0.3775, 0.5113, 0.2998, 0.9780],
        [0.2836, 0.4893, 0.7788, 0.1989],
        [0.2366, 0.2121, 0.2523, 0.5487]], dtype=torch.float64)

Both NumPy and torch support multiple different data types. By default, NumPy uses 64-bit floats, and PyTorch defaults to 32-bit floats. However, if you create a PyTorch tensor from a NumPy tensor, it uses the same type as the given NumPy tensor. You can see that in the previous output, where PyTorch let us know that dtype=torch.float64 since it is not the default choice.

The most common types we care about for deep learning are 32-bit floats, 64-bit integers (Longs), and booleans (i.e., binary True/False). Most operations leave the tensor type unchanged unless we explicitly create or cast it to a new type. To avoid issues with types, you can specify explicitly what type of tensor you want to create when calling a function. The following code checks what type of data is contained in our tensor using the dtype attribute:

print(x_np.dtype, x_pt.dtype) 
float64 torch.float64

x_np = np.asarray(x_np, dtype=np.float32)     #1
x_pt = torch.tensor(x_np, dtype=torch.float32) 
print(x_np.dtype, x_pt.dtype) 

float32 torch.float32

The main exception to using 32-bit floats or 64-bit integers as the dtype is when we need to perform logic operations (like Boolean AND, OR, NOT), which we can use to quickly create binary masks.

A mask is a tensor that tells us which portions of another tensor are valid to use. We use masks in some of our more complex neural networks. For example, let’s say we want to find every value greater than 0.5 in a tensor. Both PyTorch and NumPy let us use the standard logic operators to check for things like this:

b_np = (x_np > 0.5) 
print(b_np) 
print(b_np.dtype) 

[[False True True False]
[False True False True]
[False False True False]
[False False False True]]
bool

b_pt = (x_pt > 0.5) 
print(b_pt) 
print(b_pt.dtype) 
tensor([[False,  True,  True, False],
       [False,  True, False,  True], 
       [False, False,  True, False], 
       [False, False, False,  True]]) 
torch.bool

While the NumPy and PyTorch APIs are not identical, they share many functions with the same names, behaviors, and characteristics:

    np.sum(x_np) 

[13]: 7.117571

     torch.sum(x_pt) 

[14]: tensor(7.1176)

While many functions are the same, some are not quite identical. There may be slight differences in behavior or in the arguments required. These discrepancies are usually because the PyTorch version has made changes that are particular to how these methods are used for neural network design and execution. Following is an example of the transpose function, where PyTorch requires us to specify which two dimensions to transpose. NumPy takes the two dimensions and transposes them without complaint:

     np.transpose(x_np) 
[15]: array([[0.05095622, 0.37754387, 0.28363854, 0.23659933], 
             [0.6433009 , 0.51127553, 0.48929802, 0.21207824], 
             [0.982938  , 0.29976255, 0.77875257, 0.25225455], 
             [0.2735579 , 0.97804976, 0.19889717, 0.54866767]], dtype=float32)
     torch.transpose(x_pt, 0, 1) 

[16]: tensor([[0.0510, 0.3775, 0.2836, 0.2366], 
              [0.6433, 0.5113, 0.4893, 0.2121], 
              [0.9829, 0.2998, 0.7788, 0.2523], 
              [0.2736, 0.9780, 0.1989, 0.5487]])

PyTorch does this because we often want to transpose dimensions of a tensor for deep learning applications, whereas NumPy tries to stay with more general expectations. As shown next, we can transpose two of the dimensions in our torch_tensor3d from the start of the chapter. Originally it had a shape of (4,2,3). If we transpose the first and third dimensions, we get a shape of (3,2,4):

print(torch.transpose(torch_tensor3d, 0, 2).shape) 
torch.Size([3, 2, 4])

Because such differences exist, you should always double-check the PyTorch documentation at https://pytorch.org/docs/stable/index.html if you attempt to use a function you are familiar with but suddenly find it does not behave as expected. It is also a good tool to have open when using PyTorch. There are many different functions that can help you in PyTorch, and we cannot review them all.

1.2.1  PyTorch GPU acceleration

The first important functionality that PyTorch gives us beyond what NumPy can do is using a GPU to accelerate mathematical calculations. GPUs are hardware in your computer specifically designed for 2D and 3D graphics, mainly to accelerate videos (watching an HD movie) or play video games. What does that have to do with neural networks? Well, a lot of the math involved in making 2D and 3D graphics fast istensor-based or at least tensor-related. For this reason, GPUs have been getting good at doing many things we want very quickly. As graphics, and thus GPUs, became better and more powerful, people realized they could also be used for scientific computing and ML.

At a high level, you can think of GPUs as giant tensor calculators. You should almost always use a GPU when doing anything with neural networks. It is a good pair-up since neural networks are compute-intensive, and GPUs are fast at the exact type of computations we need to perform. If you want to do deep learning in a professional context, you should invest in a computer with a powerful NVIDIA GPU. But for now, we can get by for free using Colab.

The trick to using GPUs effectively is to avoid computing on a small amount of data. This is because your computer’s CPU must first move data to the GPU, then ask the GPU to perform its math, wait for the GPU to finish, and then copy the results back from the GPU. The steps in this process are fairly slow; and if we are only calculating a few things, using a GPU takes longer than the CPU would take to do the math.

What counts as “too small?" Well, that depends on your CPU, the GPU, and the math you are doing. If you are worried about this problem, you can do some benchmarking to see if using the CPU is faster. If so, you are probably working on too little data.

Let’s test this with matrix multiplication—a basic linear algebra operation that is common in neural networks. If we have matrices Xn, m and Ym, p, we can compute a resulting matrix Cn, p = Xn, mYm, p. Note that C has as many rows as X and as many columns as Y. When implementing neural networks, we do lots of operations that change the shape of a tensor, like what happens when we multiply two matrices together. This is a common source of bugs, so you should think about tensor shapes when writing code.

We can use the timeit library: it lets us run code multiple times and tells us how long it took to run. We make a larger matrix X, compute XX several times, and see how long that takes to run:

import timeit 
x = torch.rand(2**11, 2**11) 
time_cpu = timeit.timeit("x@x", globals=globals(), number=100)

It takes a bit of time to run that code, but not too long. On my computer, it took 6.172 seconds to run, which is stored in the time_cpu variable. Now, how do we get PyTorch to use our GPU? First we need to create a device reference. We can ask PyTorch to give us one using the torch.device function. If you have an NVIDIA GPU, and the CUDA drivers are installed properly, you should be able to pass in cuda as a string and get back an object representing that device:

print("Is CUDA available? :", torch.cuda.is_available()) 
device = torch.device("cuda") 
Is CUDA available? : True

Now that we have a reference to the GPU (device) we want to use, we need to ask PyTorch to move that object to the given device. Luckily, that can be done with a simple to function; then we can use the same code as before:

x = x.to(device) 
time_gpu = timeit.timeit("x@x", globals=globals(), number=100)

When I run this code, the time to perform 100 multiplications is 0.6191 seconds, which is an instant 9.97× speedup. This was a pretty ideal case, as matrix multiplications are super-efficient on GPUs, and we created a big matrix. You should try making the matrix smaller and see how that impacts the speedup you get.

Be aware that this only works if every object involved is on the same device. Say you run the following code, where the variable x has been moved onto the GPU and y has not (so it is on the CPU by default):

x = torch.rand(128, 128).to(device) 
y = torch.rand(128, 128) 
x*y

You will end up getting an error message that says:

RuntimeError: expected device cuda:0 but got device cpu

The error tells you which device the first variable is on (cuda:0) but that the second variable was on a different device (cpu). If we instead wrote y*x you would see the error change to expected device cpu but got device cuda:0. Whenever you see an error like this, you have a bug that kept you from moving everything to the same compute device.

The other thing to be aware of is how to convert PyTorch data back to the CPU. For example, we may want to convert a tensor back to a NumPy array so that we can pass it to Matplotlib or save it to disk. The PyTorch tensor object has a .numpy() method that will do this, but if you call x.numpy(), you will get this error:

TypeError: can't convert CUDA tensor to numpy. Use Tensor.cpu() 
to copy the tensor to host memory first.

Instead, you can use the handy shortcut function .cpu() to move an object back to the CPU, where you can interact with it normally. So you will often see code that looks like x.cpu().numpy() when you want to access the results of your work.

The .to() and .cpu() methods make it easy to write code that is suddenly GPU accelerated. Once on a GPU or similar compute device, almost every method that comes with PyTorch can be used and will net you a nice speedup. But sometimes we want to store tensors and other PyTorch objects in a list, dictionary, or other standard Python collection. To help with that, we can define this moveTo function, which goes recursively through the common Python and PyTorch containers and moves every object found onto the specified device:

def moveTo(obj, device): 
    """ 
    obj: the python object to move to a device, or to move its
     contents to a device
    device: the compute device to move objects to 
    """
    if isinstance(obj, list): 
        return [moveTo(x, device) for x in obj] 
    elif isinstance(obj, tuple): 
        return tuple(moveTo(list(obj), device)) 
    elif isinstance(obj, set): 
        return set(moveTo(list(obj), device)) 
    elif isinstance(obj, dict): 
        to_ret = dict() 
        for key, value in obj.items(): 
            to_ret[moveTo(key, device)] = moveTo(value, device) 
        return to_ret 
    elif hasattr(obj, "to"): 
        return obj.to(device) 
    else: 
        return obj

some_tensors = [torch.tensor(1), torch.tensor(2)] 

print(some_tensors) print(moveTo(some_tensors, device)) 
[tensor(1), tensor(2)]
[tensor(1, device='cuda:0'), tensor(2, device='cuda:0')]

The first time we printed the arrays, we saw tensor(1) and tensor(2); but after using the moveTo function, device=cuda:0 appeared. We won’t have to use this function often, but when we do, it will make our code easier to read and write. With that, we now have the fundamentals to write fast code accelerated by GPUs.

livebook features:
settings
Update your profile, view your dashboard, tweak the text size, or turn on dark mode.
settings
Sign in for more free preview time

1.3 Automatic differentiation

So far, we’ve seen that PyTorch provides an API similar to NumPy for performing mathematical operations on tensors, with the advantage of using a GPU (when available) to perform faster math operations. The second major foundation that PyTorch gives us is automatic differentiation: as long as we use PyTorch-provided functions, PyTorch can compute derivatives (also called gradients) automatically for us. In this section, we learn what that means and how automatic differentiation ties into the task of minimizing a function. In the next section, we see how to wrap it all up in a simple API provided by PyTorch.

Your first thought may be, “What is a derivative, and why do I care about that?” Remember from calculus that the derivative of a function f(x) tells us how quickly the value of f(x) is changing. We care about this because we can use the derivative of a function f(x) to help us find an input x* that is a minimizer of f(x). The value x* being a minimizer means the value of f(x*) is smaller than f(x*+z) for whatever value we set z to. The mathy way to say this is f(x*) ≤ f(z), ∀x*z:

Another way to say this is that if I wrote the following code, I would be stuck waiting for an infinite loop:

while f(x_star) <= f(random.uniform(-1e100, 1e100)): 
   pass

Why do we want to minimize a function? For all the kinds of ML and deep learning we discuss in this book, we train neural networks by defining a loss function. The loss function tells the network, in a numeric and quantifiable way, how badly it is doing at the problem. So if the loss is high, things are going poorly. A high loss means the network is losing the game, and badly. If the loss is zero, the network has solved the problem perfectly. We don’t usually allow the loss to go negative because that gets confusing to think about.

When you read math about neural networks, you will often see the loss function defined as ℓ(x), where x are the inputs to the network and ℓ(x) gives us the loss the network received. Because of this, loss functions return scalars. This is important because we can compare scalars and say that one is definitively bigger or smaller than another, so it becomes unambiguous how bad a network is at the game. The derivative is generally defined with respect to a single variable, but our networks will have many variables (parameters). When getting the derivative with respect to multiple variables, we call it a gradient; you can apply the same intuition about derivatives and one variable to gradients over many variables.

We have stated that gradients are helpful, and perhaps you remember from a calculus class about minimizing functions using derivatives and gradients. Let’s do a bit of a math reminder about how to find the minimum of a function using calculus.

Say we have the function f(x) = (x−2)2. Let’s define that with some PyTorch code and plot what the function looks like:

    def f(x):
        return torch.pow((x-2.0), 2)

    x_axis_vals = np.linspace(-7,9,100) 
    y_axis_vals = f(torch.tensor(x_axis_vals)).numpy()

    sns.lineplot(x=x_axis_vals, y=y_axis_vals, label='$f(x)=(x-2)^2$')

[22]: <AxesSubplot:>

1.3.1  Using derivatives to minimize losses

We can clearly see that the minimum of this function is at x = 2, where we get the value f(2) = 0. But this is an intentionally easy problem. Let’s say we can’t plot it; we can use calculus to help us find the answer.

We denote the derivative of f(x) as f′(x), and we can get the answer (using calculus) that f′(x) = 2 ⋅ x − 4. The minimum of a function (x*) exists at critical points, which are points where f′(x) = 0. So let’s find them by solving for x. In our case, we get

2 ⋅ x − 4 = 0

2 ⋅ x = 4

(add 4 to both sides)

x = 4/2 = 2

(divide each side by 2).

This required us to solve the equation for when f′(x) = 0. PyTorch can’t quite do that for us because we are going to be developing more complicated functions where finding the exact answer is not possible. But say we have a current guess, x?, that we are pretty sure is not the minimizer. We can use f′(x?) to help us determine how to adjust x? so that we move closer to a minimizer.

How is that possible? Let’s plot f(x) and f′(x) at the same time:

def fP(x):                        #1
        return 2*x-4

    y_axis_vals_p = fP(torch.tensor(x_axis_vals)).numpy()

    sns.lineplot(x=x_axis_vals, y=[0.0]*len(x_axis_vals), 
     label="0", color=’black’)      #2
    sns.lineplot(x=x_axis_vals, y=y_axis_vals, 
     label=’Function to Minimize $f(x) = (x-2)^2$’) 
    sns.lineplot(x=x_axis_vals, y=y_axis_vals_p,
     label="Gradient of the function $f’(x)=2 x - 4$")

[23]: <AxesSubplot:>
#1 Defines the derivative of f(x) manually
#2 Draws a black line at 0 so we can easily tell if something is positive or negative

Look at the orange line. When we are too far to the left of the minimum (x = 2), we see that f′(x?) < 0. When we are to the right of the minimum, we instead get f′(x) > 0. Only when we are at a minimum do we see that f′(x?) = 0. So if f′(x?) < 0, we need to increase x?; and if f′(?x) > 0, we need to decrease the value of x?. The sign of the gradient f tells us which direction we should move to find a minimizer. This process of gradient descent is summarized in figure 1.6.

Figure 1.6 The process to minimize a function f(x) using its derivative f′(x) is called gradient descent, and this figure shows how it is done. We iteratively compute f′(x) to decide whether x should be larger or smaller to make the value of f(x) as small as possible. The process stops when we are close enough to the gradient being zero. You can also stop early if you have done a lot of updates: “close enough is good enough” holds true for deep learning, and we rarely need to perfectly minimize a function.

We also care about the magnitude of f′(x?). Because we are looking at a one-dimensional function, the magnitude just means the absolute value of f′(x?): i.e., |f′(x?)|. The magnitude gives us an idea how far we are from the minimizer. So the sign of f′(x?) (<0 or >0) tells us which direction we should move, and the size (|f′(x)|) tells us how far we should move.

This is not a coincidence. It is always true for any function. If we can compute a derivative, we can find a minimizer. You may be thinking, “I don’t remember my calculus all that well,” or complaining that I skipped the steps about how to compute f′(x). This is why we use PyTorch: automatic differentiation computes the value of f′(x) for us. Let’s use the toy example of f(x) = (x−2)2 to see how it works.

1.3.2  Calculating a derivative with automatic differentiation

Now that we understand the concept of minimizing a function using its derivative, let’s walk through the mechanics of doing it in PyTorch. First, let’s create a new variable to minimize. We do this similar to before, but we add a new flag telling PyTorch to keep track of the gradient. This is stored in a variable called grad, which does not exist yet since we haven’t computed anything:

x = torch.tensor([-3.5], requires_grad=True) 
print(x.grad)

None

We see there is no current gradient. Let’s try computing f(x), though, and see if anything changes now that we set requires_grad=True:

value = f(x) 
print(value)

tensor([30.2500], grad\_fn=<PowBackward0>)

Now when we print the value of the returned variable, we get slightly different output. In the first part, the value 30.25 is printed, which is the correct value of f(−3.5). But we also see this new grad_fn=<PowBackward0>. Once we tell PyTorch to start calculating gradients, it begins to keep track of every computation we do. It uses this information to go backward and calculate the gradients for everything that was used and had a requires_grad flag set to True.

Once we have a single scalar value, we can tell PyTorch to go back and use this information to compute the gradients. This is done using the .backward() function, after which we see a gradient in our original object:

value.backward() 
print(x.grad)

tensor([-11.])

With that, we have now computed a gradient for the variable x. Part of the power of PyTorch and automatic differentiation is that you can make the function f(x) do almost anything, as long as it is implemented using PyTorch functions. The code we wrote for computing the gradient of x will not change. PyTorch handles all the details of how to compute it for us.

1.3.3  Putting it together: Minimizing a function with derivatives

Now that PyTorch can compute gradients for us, we can use the automatic differentiation of our PyTorch function f(x) to numerically find the answer f(2) = 0. We are going to describe it first using mathematical notation and then in code.

We start with our current guess, xcur = − 3.5. I chose 3.5 arbitrarily; in real life, you would usually pick a random value. We also keep track of our previous guess using xprev. Since we have not done anything yet, it is fine to set the previous step to any large value (e.g., xprev = xcur * 100).

Next, we compare whether our current and previous guesses are similar. We do this by checking whether xcurxprev2 > ϵ. The function z2 is called the norm or 2-norm. Norms are the most common and standard ways of measuring magnitude for vectors and matrices. For one-dimensional cases (like this one), the 2-norm is the same as the absolute value. If we do not explicitly state what kind of norm we are talking about, you should always assume the 2-norm. The value ϵ is a common mathematical notation to refer to an arbitrary small value. So the way to read this is

Now we know that xcurxprev2 > ϵ is how we check whether there are large ( > ϵ) magnitude (∥ ⋅ ∥2) changes (xcurxprev) between our guesses. If this is false, xcurxprev2ϵ, which means the change was small and we can stop. Once we stop, we accept xcur as our answer to the value of x that minimized f(x) If not, we need a new, better guess.

To get this new guess, we move in the opposite direction of the derivative. This looks like this: xcur = xcurηf′(xcur). The value η is called the learning rate and is usually a small value like η = 0.1 or η = 0.01. We do this because the gradient f′(x) tells us which way to move but only gives us a relative answer about how far away we are. It doesn’t tell us exactly how far we should travel in that direction. Since we don’t know how far to travel, we want to be conservative and go a little slower. Figure 1.7 shows why.

Figure 1.7 Three examples of how the learning rate η (also called step size) impacts learning. On the left, η is smaller than necessary. This still reaches the minimum but takes more steps than needed. If we knew the perfect value of η, we could set it just right, to take the smallest number of steps to the minimum value (middle). On the right, η is too big, which causes divergence. We never reach the solution!

By taking smaller steps in the current direction, we don’t “drive past” the answer and have to turn around. Look at the previous example of our function to understand how that happens. If we have the exactly correct best value of η (middle image), we can take one step to the minimum. But we do not know what that value is. If we are instead conservative and choose a value that is likely smaller than we need, we may take more steps to get to the answer, but we eventually get there (left image). If we set our learning rate too high, we can end up shooting past the solution and bouncing around it (right image).

That might sound like a lot of scary math, but hopefully you will feel better about it when you look at the code that does the work. It is only a few lines long. At the end of the loop, we print the value of xcur and see that it is equal to 2.0; PyTorch found the answer. Notice that when we define a PyTorch Tensor object, it has a child member .grad that stores the computed gradients for that variable, as well as a .data member that holds the underlying value. You usually shouldn’t access either of these fields unless you have a specific reason to; for now, we are using them to demonstrate the mechanics of autograd:

x = torch.tensor([-3.5], requires_grad=True)

x_cur = x.clone() 
x_prev = x_cur*100                                  #1

epsilon = 1e-5                                      #2

eta = 0.1                                           #3

while torch.linalg.norm(x_cur-x_prev) > epsilon:  
    x_prev = x_cur.clone()                          #4

    value = f(x)                                    #5
    value.backward() 
    x.data -= eta * x.grad


    x.grad.zero_()                                  #6


    x_cur = x.data                                  #7

    print(x_cur)

tensor([2.0000])
livebook features:
highlight, annotate, and bookmark
Select a piece of text and click the appropriate icon to annotate, bookmark, or highlight (you can also use keyboard shortcuts - h to highlight, b to bookmark, n to create a note).

You can automatically highlight by performing the text selection while keeping the alt/ key pressed.
highlights
join today to enjoy all our content. all the time.
 

1.4 Optimizing parameters

What we just did, finding the minimum of a function f(⋅), is called optimization. Because we specify the goal of our network using a loss function ℓ(⋅), we can optimize f(⋅) to minimize our loss. If we reach a loss ℓ(⋅) = 0, our network appears to have solved the problem. This is why we care about optimization and is foundational to how most modern neural networks are trained today. Figure 1.8 shows a simplification of how it works.

Figure 1.8 How neural networks use the loss ℓ(⋅) and optimization process. The neural network is controlled by its parameters θ. To make useful predictions about the data, we need to alter the parameters. We do so by first computing the loss ℓ(⋅), which tells us how badly the network is doing. Since we want to minimize the loss, we can use the gradient to alter the parameters! This gets the network to make useful predictions.

Because of how important optimization is, PyTorch includes two additional concepts to help us: parameters and optimizers. A Parameter of a model is a value that we alter using an Optimizer to try to reduce our loss ℓ(⋅). We can easily convert any tensor into a Parameter using the nn.Parameter class. To do that, let’s re-solve the previous problem of minimizing f(x) = (x−2)2 with an initial guess of xcur = 3.5. The first thing we do is create a Parameter object for the value of x, since that is what we are going to alter:

x_param = torch.nn.Parameter(torch.tensor([-3.5]), requires_grad=True)

The object x_param is now a nn.Parameter, which behaves the same way tensors do. We can use a Parameter anywhere we would use a tensor in PyTorch, and the code will work fine. But now we can create an Optimizer object. The simplest optimizer we use is called SGD, which stands for stochastic gradient descent. The word gradient is there because we are using the gradients/derivatives of functions. Descent means we are minimizing or descending to a lower value of the function that we are minimizing. We get to the stochastic part in the next chapter.

To use SGD, we need to create the associated object with a list of Parameters that we want to adjust. We can also specify the learning rate η or accept the default. The following code specifies η to match the original code:

optimizer = torch.optim.SGD([x_param], lr=eta)

Now we can rewrite the previous ugly loop into something cleaner that looks much closer to how we train neural networks in practice. We will loop over the optimization problem a fixed number of times, which we often call epochs. The zero_grad method does the cleanup we did manually before for every parameter passed in as an input. We compute our loss, call .backward() on that loss, and then ask the optimizer to perform one .step() of the optimization:

for epoch in range(60): 
    optimizer.zero_grad()             #1
    loss_incurred = f(x_param) 
    loss_incurred.backward() 
    optimizer.step()                  #2
print(x_param.data)

The code prints out tensor(2.0000), just like before. This will make our lives easier when we have literally millions of parameters in our network.

You’ll notice a significant change in this code: we are not optimizing until we hit a gradient of zero or the difference between the previous and current solutions is very small. Instead, we are doing something dumber: a fixed number of steps. In deep learning, we rarely get to a loss of zero, and we would have to wait way too long for that to happen. So most people pick a fixed number of epochs that they are willing to wait for and then see what the results look like at the end. This way, we get an answer faster, and it’s usually good enough to use.

livebook features:
discuss
Ask a question, share an example, or respond to another reader. Start a thread by selecting any piece of text and clicking the discussion icon.
discussions
Sign in for more free preview time

1.5 Loading dataset objects

We have learned a little about the basic PyTorch tools. Now we want to start training a neural network. But first we need some data. Using the common notation of ML, we need a set of input data X and associated output labels y. In PyTorch, we represent that with a Dataset object. By using this interface, PyTorch provides efficient loaders that automatically handle using multiple CPU cores to pre-fetch the data and keep a limited amount of data in memory at a time. Let’s start by loading a familiar dataset from scikit-learn: MNIST. We convert it from a NumPy array to the form PyTorch likes.

PyTorch uses a Dataset class to represent a dataset, and it encodes the information about how many items are in the dataset and how to get the nth item in the dataset. Let’s see what that looks like:

from torch.utils.data import Dataset
from sklearn.datasets import fetch_openml

X, y = fetch_openml(’mnist_784’, version=1, return_X_y=True)         #1
print(X.shape)

(70000, 784)

We have loaded the classic MNIST dataset with a total of 70,000 rows and 784 features. Now we will create a simple Dataset class that takes in X, y as input. We need to define a __getitem__ method, which will return the data and label as a tuple(inputs, outputs). The inputs are the objects we want to give to our model as inputs, and the outputs are used for the output. We also need to implement the __len__ function that returns how large the dataset is:

class SimpleDataset(Dataset):

    def __init__(self, X, y): 
        super(SimpleDataset, self).__init__()
        self.X = X 
        self.y = y

    def __getitem__(self, index): 
        inputs = torch.tensor(self.X[index,:], dtype=torch.float32)  #1
        targets = torch.tensor(int(self.y[index]), dtype=torch.int64) 
        return inputs, targets

    def __len__(self): 
        return self.X.shape[0] 
dataset = SimpleDataset(X, y)                                        #2

Notice that we do the minimal amount of work in the constructor and instead move it into the __getitem__ function. This is intentional design and a habit you should emulate when doing deep learning work. In many cases, we need to do non-trivial preprocessing, preparation, and conversions to get our data into a form that a neural network can learn from. If you put those tasks into the __getitem__ function, you get the benefit of PyTorch doing the work on an as-needed basis while you wait for your GPU to finish processing some other batch of data, making your overall process more compute efficient. This becomes really important when you work on larger datasets where preprocessing would cause a long delay up front or require extra memory, and doing the prep only when needed can save you a lot of storage.

Note

You may wonder why we use int64 as the tensor type for the targets. Why not int32 or even int8, if we know our labels are in a smaller range, or uint32 if no negative values will occur? The unsatisfying answer is that for any situation where int types are required, PyTorch is hardcoded to work only with int64, so you just have to use it. Similarly, when floating-point values are needed, most of PyTorch will only work with float32, so you have to use float32 instead of float64 or other types. There are exceptions, but they are not worth getting into while learning the fundamentals.

Now we have a simple dataset object. It keeps the entire dataset in memory, which is OK for small datasets, but we want to fix it in the future. We can confirm that the dataset still has 70,000 examples, and each example has 784 features, as before, and quickly confirm that the length and index functions we implemented work as expected:

print("Length: ", len(dataset)) 
example, label = dataset[0]
print("Features: ", example.shape) #1
print("Label of index 0: ", label) 

Length: 70000
Features: torch.Size([784])
Label of index 0: tensor(5)

MNIST is a dataset of hand-drawn numbers. We can visualize this by reshaping the data back into an image to confirm that our data loader is working:

    plt.imshow(example.reshape((28,28)))

[34]: <matplotlib.image.AxesImage at 0x7f4721b9fc50>

1.5.1  Creating a training and testing split

Now we have all of our data in one dataset. However, like good ML practitioners, we should create a training split and a testing split. In some cases, we have a dedicated training and testing dataset. If that is the case, you should create two separate Dataset objects—one for training and one for testing—from the respective data sources.

In this case, we have one dataset. PyTorch has a simple utility to break the corpus into train and test splits. Let’s say we want 20% of the data to be used for testing. We can do that as follows using the random_split method:

train_size = int(len(dataset)*0.8) test_size =
len(dataset)-train_size

train_dataset, test_dataset = torch.utils.data.random_split(dataset, (train_size, test_size)) print("{} examples for training and {} for testing".format( len(train_dataset), len(test_dataset)))

56000 examples for training and 14000 for testing

Now we have train and test sets. In reality, the first 60,000 points are the standard training set for MNIST, and the last 10,000 are the standard test set. But the point was to show you the function for creating randomized splits yourself.

With that, we have learned about all of the foundational tools that PyTorch provides:

  • A NumPy-like tensor API, which supports GPU acceleration
  • Automatic differentiation, which lets us solve optimization problems
  • An abstraction for datasets

We will build on this foundation, and you may notice that it starts to impact how you think about neural networks in the future. They do not magically do what is asked, but they try to numerically solve a goal specified by a loss function ℓ(⋅). We need to be careful in how we define or choose ℓ(⋅) because that will determine what the algorithm learns.

Exercises

Share and discuss your solutions on the Manning online platform at Inside Deep Learning Exercises (https://liveproject.manning.com/project/945). Once you submit your own answers, you will be able to see the solutions submitted by other readers, and see which ones the author judges to be the best.

  1. Write a series of for loops that compute the average value in torch_tensor3d.
  2. Write code that indexes into torch_tensor3d and prints out the value 13.
  3. For every power of 2 (i.e., 2i or 2**i ) up to 211, create a random matrix X ∈ ℝ2i, 2i (i.e., X.shape should give (2**i, 2**i)). Time how long it takes to compute XX (i.e., X @ X) on a CPU and on a GPU, and plot the speedup. For what matrix sizes is the CPU faster than the GPU?
  4. We used PyTorch to find the numeric solution to f(x) = (x−2)2. Write code that finds the solution to f(x) = sin(x − 2) · (x + 2)2 + √|cos(x)|. What answer doyou get?
  5. Write a new function that takes two inputs, x and y, where f(x, y) = exp (sin(x)2)/(xy)2 + (xy)2 Use an Optimizer with initial parameter values of x = 0.2 and y = 10. What do they converge to?
  6. Create a function libsvm2Dataset that takes a path to a libsvm dataset file (see https://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/ for many that you can download) and create a new dataset object. Check that it is the correct length and that each row has the expected number of features.
  7. Challenging: Use NumPy’s memmap functionality to write the MNIST dataset to disk. Then create a MemmapedSimpleDataset that takes the mem-mapped file as input, reading the matrix from disk to create PyTorch tensors in the __getitem__ method. Why do you think this would be useful?

Summary

  • PyTorch represents almost everything using tensors, which are multidimensional arrays.
  • Using a GPU, PyTorch can accelerate any operation done using tensors.
  • PyTorch tracks what we do with tensors to perform automatic differentiation, which means it can calculate gradients.
  • We can use gradients to minimize a function; the values altered by a gradient are parameters.
  • We use a loss function to quantify how well a network is doing at a task and gradients to minimize that loss, which results in learning the parameters of the network.
  • PyTorch provides a Dataset abstraction so that we can let PyTorch deal with some nuisance tasks and minimize memory usage.
sitemap

Unable to load book!

The book could not be loaded.

(try again in a couple of minutes)

manning.com homepage