
In the last chapter, you saw how you can determine parameter values through optimizing a loss function using stochastic gradient descent (SGD). This approach also works for DL models that have millions of parameters. But how did we arrive at the loss function? In the linear regression problem (see sections 1.4 and 3.1), we used the mean squared error (MSE) as a loss function. We don’t claim that it is a bad idea to minimize the squared distances of the data points from the curve. But why use squared and not, for example, the absolute differences?
It turns out that there is a generally valid approach for deriving the loss function when working with probabilistic models. This approach is called the maximum likelihood approach (MaxLike). You’ll see that the MaxLike approach yields for the linear regression the MSE as loss function for some assumptions, which we discuss in detail in this chapter.