Large Scale Distributed Acoustic Modeling With Back-off N-grams
Venue
ICSI, Berkeley, California (2013)
Publication Year
2013
Authors
Ciprian Chelba, Peng Xu, Fernando Pereira, Thomas Richardson
BibTeX
Abstract
Google Voice Search is an application that provides a data-rich setup for both
language and acoustic modeling research. The approach we take revives an older
approach to acoustic modeling that borrows from n-gram language modeling in an
attempt to scale up both the amount of training data, and the model size (as
measured by the number of parameters in the model), to approximately 100 times
larger than current sizes used in automatic speech recognition. Speech recognition
experiments are carried out in an N-best list rescoring framework for Google Voice
Search. We use 87,000 hours of training data (speech along with transcription)
obtained by filtering utterances in Voice Search logs on automatic speech
recognition confidence. Models ranging in size between 20--40 million Gaussians are
estimated using maximum likelihood training. They achieve relative reductions in
word-error-rate of 11% and 6% when combined with first-pass models trained using
maximum likelihood, and boosted maximum mutual information, respectively.
Increasing the context size beyond five phones (quinphones) does not help.
