Train faster, generalize better: Stability of stochastic gradient descent
Abstract
We show that any model trained by a stochastic gradient method with few iterations
has vanishing generalization error. Our results apply to both convex and non-convex
optimization under standard Lipschitz and smoothness assumptions. Our bounds hold
in cases where existing uniform convergence bounds do not apply, for instance, if
there is no explicit form of regularization and the model capacity far exceeds the
sample size. Conceptually, our findings help explain the widely observed empirical
success of training large models with gradient descent methods. They further
underscore the importance of reducing training time beyond the obvious benefit of
saving time.
