chapter five

5 Modern training techniques

This chapter covers

Improving long-term training using a learning rate schedule
Improving short-term training using optimizers
Combining learning rate schedules and optimizers to improve any deep model’s results
Tuning network hyperparameters with Optuna

At this point, we have learned the basics of neural networks and three types of architectures: fully connected, convolutional, and recurrent. These networks have been trained with an approach called stochastic gradient descent (SGD), which has been in use since at least the 1960s. Newer improvements to learning the parameters of our network have been invented since then, like momentum and learning rate decay, which can improve any neural network for any problem by converging to better solutions in fewer updates. In this chapter, we learn about some of the most successful and widely used variants of SGD in deep learning.

5.1 Gradient descent in two parts

5.1.1 Adding a learning rate schedule

5.1.2 Adding an optimizer

5.1.3 Implementing optimizers and schedulers

5.2 Learning rate schedules

5.2.1 Exponential decay: Smoothing erratic training

5.2.2 Step drop adjustment: Better smoothing

5.2.3 Cosine annealing: Greater accuracy but less stability

5.2.4 Validation plateau: Data-based adjustments

5.2.5 Comparing the schedules

5.3 Making better use of gradients

5.3.1 SGD with momentum: Adapting to gradient consistency

5.3.2 Adam: Adding variance to momentum

5.3.3 Gradient clipping: Avoiding exploding gradients

5.4 Hyperparameter optimization with Optuna

5.4.1 Optuna

5.4.2 Optuna with PyTorch

5.4.3 Pruning trials with Optuna

Exercises

Summary