Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice
NIPS (2017) (to appear)
Jeffrey Pennington, Sam Schoenholz, Surya Ganguli
It is well known that weight initializations in deep networks can have a dramatic impact on learning speed. For example, ensuring the mean squared singular value of a network's input-output Jacobian is O(1) is essential for avoiding exponentially vanishing or exploding gradients. Moreover, in deep linear networks, ensuring all singular values of the Jacobian are concentrated near 1, a property known as dynamical isometry, can yield a dramatic additional speed up in learning. However, it is unclear how to achieve dynamical isometry in nonlinear deep networks. We address this question by employing powerful tools from free probability theory to analytically compute the entire singular value distribution of a deep network's input-output Jacobian as a function of depth, weight initialization and nonlinearity. Intriguingly, we find that ReLU networks can never achieve such isometry, regardless of the weight initialization, whereas sigmoidal networks can achieve isometry, but only with orthogonal weight initializations, and not Gaussian. Moreover, we demonstrate empirically that nonlinear networks achieving dynamical isometry learn orders of magnitude faster than networks that do not. Overall, our analysis reveals that controlling the entire distribution of Jacobian singular values, and not just its second moment, is an important design consideration in deep learning, and satisfying this design consideration enables sigmoidal networks to outperform ReLU networks.