8 Regularization via Optimization
This chapter covers
- The implicit regularization effect of SGD and recent theoretical analysis using mean iterates of SGD gradient flow
- SGD convergence in expectation
- Popular SGD variants such as momentum and Adam which do not necessarily generalize better than vanilla SGD
- Theoretical analysis of SGD’s convergence behavior in a univariate linear regression setting
The regularization techniques we have covered so far, including regularization via data, model, and objective function, involve additional treatments to the model training process in order to achieve a regularization effect. For example, we can regularize an overfit model by augmenting the training data, adding a dropout layer to the neural network, or penalizing the magnitude of the weights in the objective function. However, the optimization algorithm, the last component in the model training process and the topic of this chapter, can introduce an implicit regularization effect without any additional treatment. In other words, we reap the benefit of regularization while optimizing the model as usual.