chapter eight

8 Regularization via Optimization

This chapter covers

The implicit regularization effect of SGD and recent theoretical analysis using mean iterates of SGD gradient flow
SGD convergence in expectation
Popular SGD variants such as momentum and Adam which do not necessarily generalize better than vanilla SGD
Theoretical analysis of SGD’s convergence behavior in a univariate linear regression setting

The regularization techniques we have covered so far, including regularization via data, model, and objective function, involve additional treatments to the model training process in order to achieve a regularization effect. For example, we can regularize an overfit model by augmenting the training data, adding a dropout layer to the neural network, or penalizing the magnitude of the weights in the objective function. However, the optimization algorithm, the last component in the model training process and the topic of this chapter, can introduce an implicit regularization effect without any additional treatment. In other words, we reap the benefit of regularization while optimizing the model as usual.

8.1 Stochastic optimization

8.1.1 Empirical risk minimization via gradient descent

8.1.2 Convergence of SGD

8.1.3 Implicit regularization of SGD

8.1.4 Analyzing the mean iterate

8.1.5 SGD variants: better or worse?

8.2 More on SGD convergence

8.2.1 SGD in univariate linear regression

8.2.2 SGD’s convergence in expectation

8.2.3 SGD: past, present, and future

8.3 Summary