chapter five

5 Modern Training Techniques

This chapter covers

Improving “long term” training using a learning rate schedule.
Improving “short term” training using different optimizers.
Combining learning rate schedules and optimizers to improve deep model’s results.
Tuning your network’s hyper parameters with Optuna.

At this point we have learned the basics of neural networks and three different types of architectures: fully-connected, convolutional, and recurrent. All of these networks have been trained with an approach called stochastic gradient descent (SGD), which has been in use since the 1960s and even earlier. Newer improvements to learning the parameters of our network have been invented since then, like momentum and learning rate decay, can improve any and all neural networks for any problem by converging to better solutions in fewer updates. In this chapter we will learn about some of the most successful and widely used variants of SGD in deep learning.

5.1 Gradient Descent In Two Parts

5.1.1 Adding a Learning Rate Schedule

5.1.2 Adding an Optimizer

5.1.3 Implementing Optimizers and Schedulers

5.2 Different Learning Rate Schedules

5.2.1 Exponential Decay: Smoothing Erratic Training

5.2.2 Step Drop Adjustment: Better Smoothing

5.2.3 Cosine Annealing: Greater Accuracy but Less Stability

5.2.4 Validation Plateau: Data Based Adjustments

5.2.5 Comparing the Schedules

5.3 Making Better Use of Gradients

5.3.1 SGD with Momentum: Adapting to Gradient Consistency

5.3.2 Adam: Adding Variance to Momentum

5.3.3 Gradient Clipping: Avoiding Exploding Gradients

5.4 Hyper-parameter Optimization With Optuna

5.4.1 Optuna

5.4.2 Optuna with Pytorch

5.4.3 Pruning Trials with Optuna

5.5 Exercises

5.6 Summary