Train faster, generalize better: Stability of stochastic gradient descent
Abstract
We show that any model trained by a stochastic gradient method with few
iterations has vanishing generalization error. Our results apply to both
convex and non-convex optimization under standard Lipschitz and smoothness assumptions. Our bounds hold in cases where existing uniform convergence bounds do not apply, for instance, if there is no explicit form of
regularization and the model capacity far exceeds the sample size. Conceptually, our findings help explain the widely observed empirical success of training large models with gradient descent methods. They further underscore the importance of reducing training time beyond the obvious benefit of saving time.