Language Modeling in the Era of Abundant Data
Abstract
The talk presents an overview of statistical language modeling as applied to
real-word problems: speech recognition, machine translation, spelling correction,
soft keyboards to name a few prominent ones. We summarize the most successful
estimation techniques, and examine how they fare for applications with abundant
data, e.g. voice search. We conclude by highlighting a few open problems: getting
an accurate estimate for the entropy of text produced by a very specific source,
e.g. query stream); optimally leveraging data that is of different degrees of
relevance to a given "domain"; does a bound on the size of a "good" model for a
given source exist?
