appendix-b

Appendix B. The backpropagation algorithm

Chapter 5 introduced sequential neural networks and feed-forward networks in particular. We briefly talked about the backpropagation algorithm, which is used to train neural networks. This appendix explains in a bit more detail how to arrive at the gradients and parameter updates that we simply stated and used in chapter 5.

We’ll first derive the backpropagation algorithm for feed-forward neural networks and then discuss how to extend the algorithm to more-general sequential and nonsequential networks. Before going deeper into the math, let’s define our setup and introduce notation that will help along the way.

A bit of notation

In this section, you’ll work with a feed-forward neural network with l layers. Each of the l layers has a sigmoid activation function. Weights of the ith layer are referred to as Wⁱ, and bias terms by bⁱ. You use x to denote a mini-batch of size k of input data to the network, and y to indicate the output of it. It’s safe to think of both x and y as vectors here, but all operations carry over to mini-batches. Moreover, we introduce the following notation:

Appendix B. The backpropagation algorithm

A bit of notation

The backpropagation algorithm for feed-forward networks

Backpropagation for sequential neural networks

Backpropagation for neural networks in general

Computational challenges with backpropagation