Investigating the learning dynamics of deep neural networks using random matrix theory
NIPS (2017) (to appear)
Jeffrey Pennington, Sam Schoenholz, Surya Ganguli
It has long been known that the learning dynamics of deep neural networks are intimately tied to the singular values of the input-output Jacobian matrix. If this matrix is poorly conditioned, training may be plagued by various pathologies such as vanishing or exploding gradients. Careful initialization of the network parameters can help avoid these pathologies. For example, there is conclusive evidence that using random orthogonal weight matrices can lead to dramatic improvements in training deep linear networks. For nonlinear networks used in practice, the benefit of such initialization strategies is less clear. In this work, we use random matrix theory to study the initial singular value distribution of the Jacobian of nonlinear neural networks. We find that the benefit of orthogonal initialization is negligible for rectified linear networks but substantial for $\tanh$ networks. We provide a rule of thumb for initializing $\tanh$ networks that approximately equilibrates the singular values of the Jacobian matrix, thereby enabling a kind of dynamical isometry over network's full depth. Finally, we perform a battery of experiments on MNIST and CIFAR10 which provide strong evidence that our theoretical analysis translates into practical improvements in training speed.