Publication Data
Distributed Discriminative Language Models for Google Voice Search
Abstract: This paper considers large-scale linear discriminative
language models trained using a distributed perceptron algorithm. The algorithm is
implemented efficiently using a MapReduce/SSTable framework. This work also introduces
the use of large amounts of unsupervised data (confidence filtered Google voice-search
logs) in conjunction with a novel training procedure that regenerates word lattices for
the given data with a weaker acoustic model than the one used to generate the
unsupervised transcriptions for the logged data. We observe small but statistically
significant improvements in recognition performance after reranking N-best lists of a
standard Google voice-search data set.
