An Empirical study of learning rates in deep neural networks for speech recognition
Abstract
Recent deep neural network systems for large vocabulary speech
recognition are trained with minibatch stochastic gradient descent
but use a variety of learning rate scheduling schemes. We investigate
several of these schemes, particularly AdaGrad. Based on our analysis
of its limitations, we propose a new variant ‘AdaDec’ that decouples
long-term learning-rate scheduling from per-parameter learning
rate variation. AdaDec was found to result in higher frame accuracies
than other methods. Overall, careful choice of learning rate
schemes leads to faster convergence and lower word error rates
recognition are trained with minibatch stochastic gradient descent
but use a variety of learning rate scheduling schemes. We investigate
several of these schemes, particularly AdaGrad. Based on our analysis
of its limitations, we propose a new variant ‘AdaDec’ that decouples
long-term learning-rate scheduling from per-parameter learning
rate variation. AdaDec was found to result in higher frame accuracies
than other methods. Overall, careful choice of learning rate
schemes leads to faster convergence and lower word error rates