5 Modern Training Techniques

This chapter covers

Improving “long term” training using a learning rate schedule.
Improving “short term” training using different optimizers.
Combining learning rate schedules and optimizers to improve deep model’s results.
Tuning your network’s hyper parameters with Optuna.

At this point we have learned the basics of neural networks and three different types of architectures: fully-connected, convolutional, and recurrent. All of these networks have been trained with an approach called stochastic gradient descent (SGD), which has been in use since the 1960s and even earlier. Newer approaches to learning the parameters of our network have been invented since then, and we can improve any and all neural networks for any problem by using these newer techniques.

5.1 Gradient Descent In Two Parts

5.1.1 Implementing Optimizers and Schedulers

5.2 Different Learning Rate Schedules

5.2.1 Exponential Decay

5.2.2 Step Drop Adjustment

5.2.3 Cosine Annealing

5.2.4 Validation Plateau

5.2.5 Comparing the Schedules

5.3 Making Better Use of Gradients

5.3.1 SGD with Momentum

5.3.2 Adam

5.3.3 Gradient Clipping

5.4 Hyper-parameter Optimization

5.4.1 Optuna

5.4.2 Optuna with Pytorch

5.4.3 Pruning Trials with Optuna

5.5 Exercises

5.6 Summary