chapter eight

Chapter 8. Learning signal and ignoring noise: introduction to regularization and batching

In this chapter

Overfitting
Dropout
Batch gradient descent

“With four parameters I can fit an elephant, and with five I can make him wiggle his trunk.”

John von Neumann, mathematician, physicist, computer scientist, and polymath

Three-layer network on MNIST

Let’s return to the MNIST dataset and attempt to classify it with- h the new network

In last several chapters, you’ve learned that neural networks model correlation. The hidden layers (the middle one in the three-layer network) can even create intermediate correlation to help solve for a task (seemingly out of midair). How do you know the network is creating good correlation?

When we discussed stochastic gradient descent with multiple inputs, we ran an experiment where we froze one weight and then asked the network to continue training. As it was training, the dots found the bottom of the bowls, as it were. You saw the weights become adjusted to minimize the error.

When we froze the weight, the frozen weight still found the bottom of the bowl. For some reason, the bowl moved so that the frozen weight value became optimal. Furthermore, if we unfroze the weight to do some more training, it wouldn’t learn. Why? Well, the error had already fallen to 0. As far as the network was concerned, there was nothing more to learn.

Chapter 8. Learning signal and ignoring noise: introduction to regularization and batching

In this chapter

Three-layer network on MNIST

Let’s return to the MNIST dataset and attempt to classify it with- h the new network

Well, that was easy

Memorization vs. generalization

Overfitting in neural networks

Where overfitting comes from

The simplest regularization: Early stopping

Industry standard regularization: Dropout

Why dropout works: Ensembling works

Dropout in code

Dropout evaluated on MNIST

Batch gradient descent

Summary