Random Walk Initialization for Training Very Deep Feedforward Networks
Venue
arXiv preprint, Google Inc. (2014), pp. 1-10
Publication Year
2014
Authors
David Sussillo, L.F. Abbott
BibTeX
Abstract
Training very deep networks is an important open problem in machine learning. One
of many difficulties is that the norm of the back-propagated error gradient can
grow or decay exponentially. Here we show that training very deep feed-forward
networks (FFNs) is not as difficult as previously thought. Unlike when
back-propagation is applied to a recurrent network, application to an FFN amounts
to multiplying the error gradient by a different random matrix at each layer. We
show that the successive application of correctly scaled random matrices to an
initial vector results in a random walk of the log of the norm of the resulting
vectors, and we compute the scaling that makes this walk unbiased. The variance of
the random walk grows only linearly with network depth and is inversely
proportional to the size of each layer. Practically, this implies a gradient whose
log-norm scales with the square root of the network depth and shows that the
vanishing gradient problem can be mitigated by increasing the width of the layers.
Mathematical analyses and experimental results using stochastic gradient descent to
optimize tasks related to the MNIST and TIMIT datasets are provided to support
these claims. Equations for the optimal matrix scaling are provided for the linear
and ReLU cases.
