chapter nine

9 Loss, optimization, and regularization

This chapter covers

Geometrical and algebraic introductions to loss functions
Geometrical intuitions for softmax
Optimization techniques including momentum, Nesterov, AdaGrad, Adam, and SGD
Regularization and its relationship to Bayesian approaches
Overfitting while training, and dropout

By now, it should be etched in your mind that neural networks are essentially function approximators. In particular, neural network classifiers model the decision boundaries between the classes in the feature space (a space where every input feature combination is a specific point). Supervised classifiers mark sample training data inputs in this space with a—perhaps manually generated—class label (ground truth). The training process iteratively learns a function that essentially creates decision boundaries separating the sampled training data points into individual classes. If the training data set is a reasonable representative of the true distribution of possible inputs, the network (the learned function that models the class boundaries) will classify never-before-seen inputs with good accuracy.

9.1 Loss functions

9.1.1 Quantification and geometrical view of loss

9.1.2 Regression loss

9.1.3 Cross-entropy loss

9.1.4 Binary cross-entropy loss for image and vector mismatches

9.1.5 Softmax

9.1.6 Softmax cross-entropy loss

9.1.7 Focal loss

9.1.8 Hinge loss

9.2 Optimization

9.2.1 Geometrical view of optimization

9.2.2 Stochastic gradient descent and minibatches

9.2.3 PyTorch code for SGD

9.2.4 Momentum

9.2.5 Geometric view: Constant loss contours, gradient descent, and momentum

9.2.6 Nesterov accelerated gradients

9.2.7 AdaGrad

9.2.8 Root-mean-squared propagation

9.2.9 Adam optimizer

9.3 Regularization

9.3.1 Minimum descriptor length: An Occam’s razor view of optimization

9.3.2 L2 regularization

9.3.3 L1 regularization