Learning N-gram Language Models from Uncertain Data
Venue
Interspeech (2016)
Publication Year
2016
Authors
Vitaly Kuznetsov, Hank Liao, Mehryar Mohri, Michael Riley, Brian Roark
BibTeX
Abstract
We present a new algorithm for efficiently training n-gram language models on
uncertain data, and illustrate its use for semi-supervised language model
adaptation. We compute the probability that an n-gram occurs k times in the sample
of uncertain data, and use the resulting histograms to derive a generalized Katz
backoff model. We compare semi-supervised adaptation of language models for YouTube
video speech recognition in two conditions: when using full lattices with our new
algorithm versus just the 1-best output from the baseline speech recognizer. Unlike
1-best methods, the new algorithm provides models that yield solid improvements
over the baseline on the full test set, and, further, achieves these gains without
hurting performance on any of the set of channels. We show that channels with the
most data yielded the largest gains. The algorithm was implemented via a new
semiring in the OpenFst library and will be released as part of the OpenGrm ngram
library.
