chapter four

4 Building loss functions with the likelihood approach

This chapter covers:

Using the maximum likelihood principle for estimating model parameters
Determining a loss function for classification problems
Determining a loss function for regression problems

Deep Learning models have often millions of parameters which you need to determine during the training process. In chapter 3 you have seen how you can determine the parameter values via optimizing a loss function via stochastic gradient descent (SGD). But how did we arrive at the loss function? In the linear regression problem, we used the mean squared error as a loss function. We don’t claim that it is a dumb idea to minimize the squared distances of the data points from the curve. But why use squared and not, for example, the absolute differences?

Concerning classification, we considered in chapter 2, a classification problem where the task was to decide if a banknote was faked or not. In another example, you classified images of handwritten digits (0,1,...,9). In those cases we used a loss function called categorical cross entropy. What is this and how do we get to it in the first place?

4.1 Introduction to the maximum likelihood principle, the mother of all loss functions

4.2 Deriving a loss function for a classification problem

4 Building loss functions with the likelihood approach

This chapter covers:

4.1 Introduction to the maximum likelihood principle, the mother of all loss functions

4.2 Deriving a loss function for a classification problem

4.2.1 Binary classification problem

4.2.2 Classification problems with more than two classes

4.2.3 Relationship between NLL, cross entropy and Kulback-Leilber divergence

4.3 Deriving a loss function for regression problems

4.3.1 Using a NN without hidden layer and one output neuron for modeling a linear relationship between input and output

4.3.2 Using a NN with hidden layers to model non-linear relations between input and output

4.3.3 Using a NN with an additional output for regression tasks with non-constant variance

4.4 Summary