Appendix B. The backpropagation algorithm
Chapter 5 introduced sequential neural networks and feed-forward networks in particular. We briefly talked about the backpropagation algorithm, which is used to train neural networks. This appendix explains in a bit more detail how to arrive at the gradients and parameter updates that we simply stated and used in chapter 5.
We’ll first derive the backpropagation algorithm for feed-forward neural networks and then discuss how to extend the algorithm to more-general sequential and nonsequential networks. Before going deeper into the math, let’s define our setup and introduce notation that will help along the way.
In this section, you’ll work with a feed-forward neural network with l layers. Each of the l layers has a sigmoid activation function. Weights of the ith layer are referred to as Wi, and bias terms by bi. You use x to denote a mini-batch of size k of input data to the network, and y to indicate the output of it. It’s safe to think of both x and y as vectors here, but all operations carry over to mini-batches. Moreover, we introduce the following notation: