A Bayesian Perspective on Generalization and Stochastic Gradient Descent

Sam Smith

Quoc V. Le

ICLR (2018)

Download Google Scholar

Abstract

Zhang et al. (2016) argued that understanding deep learning requires rethinking generalization. To justify this claim, they showed that deep networks can easily memorize randomly labeled training data, despite generalizing well when shown real labels of the same inputs. We show here that the same phenomenon occurs in small linear models with fewer than a thousand parameters; however there is no need to rethink anything, since our observations are explained by evaluating the Bayesian evidence in favor of each model. This Bayesian evidence penalizes sharp minima. We also explore the “generalization gap” observed between small and large batch training, identifying an optimum batch size which scales linearly with both the learning rate and the size of the training set. Surprisingly, in our experiments the generalization gap was closed by regularizing the model.

Defining the technology of today and tomorrow.

Philosophy

People

Teams

AI/ML Foundations  & Capabilities

Algorithms & Optimization

Computing Paradigms

Responsible Human-Centric Technology

Science & Societal Impact

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

A Bayesian Perspective on Generalization and Stochastic Gradient Descent

Abstract

Research Areas

Learn more about how we conduct our research

Defining the technology of today and tomorrow.

Philosophy

People

Teams

AI/ML Foundations & Capabilities

Algorithms & Optimization

Computing Paradigms

Responsible Human-Centric Technology

Science & Societal Impact

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

A Bayesian Perspective on Generalization and Stochastic Gradient Descent

Abstract

Research Areas

Learn more about how we conduct our research

AI/ML Foundations  & Capabilities