Empirical Exploration of Language Modeling for the google.com Query Stream as Applied to Mobile Voice Search
Venue
Mobile Speech and Advanced Natural Language Solutions, Springer Science+Business Media, New York (2013), pp. 197-229
Publication Year
2013
Authors
Ciprian Chelba, Johan Schalkwyk
BibTeX
Abstract
Mobile is poised to become the predominant platform over which people are accessing
the World Wide Web. Recent developments in speech recognition and understanding,
backed by high bandwidth coverage and high quality speech signal acquisition on
smartphones and tablets are presenting the users with the choice of speaking their
web search queries instead of typing them. A critical component of a speech
recognition system targeting web search is the language model. The chapter presents
an empirical exploration of the google.com query stream with the end goal of high
quality statistical language modeling for mobile voice search. Our experiments show
that after text normalization the query stream is not as ``wild'' as it seems at
first sight. One can achieve out-of-vocabulary rates below 1% using a one million
word vocabulary, and excellent n-gram hit ratios of 77/88% even at high orders such
as n=5/4, respectively. A more careful analysis shows that a significantly larger
vocabulary (approx. 10 million words) may be required to guarantee at most 1%
out-of-vocabulary rate for a large percentage (95%) of users. Using large scale,
distributed language models can improve performance significantly---up to 10%
relative reductions in word-error-rate over conventional models used in speech
recognition. We also find that the query stream is non-stationary, which means that
adding more past training data beyond a certain point provides diminishing returns,
and may even degrade performance slightly. Perhaps less surprisingly, we have shown
that locale matters significantly for English query data across USA, Great Britain
and Australia. In an attempt to leverage the speech data in voice search logs, we
successfully build large-scale discriminative N-gram language models and derive
small but significant gains in recognition performance.
