Large Scale Distributed Deep Networks
Venue
NIPS (2012)
Publication Year
2012
Authors
Jeffrey Dean, Greg S. Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc’Aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, Andrew Y. Ng
BibTeX
Abstract
Recent work in unsupervised feature learning and deep learning has shown that being
able to train large models can dramatically improve performance. In this paper, we
consider the problem of training a deep network with billions of parameters using
tens of thousands of CPU cores. We have developed a software framework called
DistBelief that can utilize computing clusters with thousands of machines to train
large models. Within this framework, we have developed two algorithms for
large-scale distributed training: (i) Downpour SGD, an asynchronous stochastic
gradient descent procedure supporting a large number of model replicas, and (ii)
Sandblaster, a framework that supports a variety of distributed batch optimization
procedures, including a distributed implementation of L-BFGS. Downpour SGD and
Sandblaster L-BFGS both increase the scale and speed of deep network training. We
have successfully used our system to train a deep network 30x larger than
previously reported in the literature, and achieves state-of-the-art performance on
ImageNet, a visual object recognition task with 16 million images and 21k
categories. We show that these same techniques dramatically accelerate the training
of a more modestly- sized deep network for a commercial speech recognition service.
Although we focus on and report performance of these methods as applied to training
large neural networks, the underlying algorithms are applicable to any
gradient-based machine learning algorithm.
