chapter nine

9 Loss, Optimization and Regularization

By now, it should be etched in the readers’ mind that neural networks are essentially function approximators. In particular, neural network classifiers model the decision boundaries between the classes in the feature space (the space where every input feature combination is a specific point). Supervised classifiers collect sample training data points in this space with known class label. The training process iteratively learns a function that essentially creates decision boundaries separating only the sampled training data points. If the training data set is a reasonable representative of the true classes, the network (i.e., the learnt function which models the class boundaries) will classify never seen before inputs with good accuracy.

When we select a specific neural network architecture (with fixed set of of layers, each with a fixed set of perceptrons with specific connections), we have essentially frozen the family of functions which we will use as function approximator. We still have to ”learn” the exact weights of the connectors between various perceptrons (sometimes called neurons). The training process iteratively sets these weights - so as to best classify the training data points. This is done by designing a loss function which measures the departure of the network output from the desired result. The network continually tries to minimize this loss. There exists a variety of loss functions to choose from.

9.1 Loss Functions

9.1.1 Quantification and Geometrical view of Loss

9.1.2 Regression Loss

9.1.3 Cross Entropy Loss

9.1.4 Cross Entropy Loss for pixel and vector values mismatch

9.1.5 SoftMax

9.1.6 SoftMax Cross Entropy Loss

9.1.7 Focal Loss

9.1.8 Hinge Loss

9.2 Optimization

9.2.1 Geometrical view of Optimization

9.2.2 SGD: Stochastic Gradient Descent and mini batches

9.2.3 PyTorch code for Stochastic Gradient Descent

9.2.4 Momentum

9.2.5 Geometric View: constant loss contours, gradient descent and momentum

9.2.6 NAG: Nesterov Accelerated Gradients

9.2.7 AdaGrad

9.2.8 RMSProp

9.2.9 Adam Optimizer

9.3 Regularization

9.3.1 MDL: Minimum Descriptor Length - an Occam’s Razor View of optimization

9.3.2 L2 Regularization

9.3.3 L1 Regularization

9.3.4 Sparsity: L1 vs L2 Regularization

9.3.5 Bayes Theorem and Stochastic view of optimization

9.3.6 Dropout

Chapter Summary