In this paper, we investigate how to optimize the vocabulary for a voice search
language model. The metric we optimize over is the out-of-vocabulary (OoV) rate
since it is a strong indicator of user experience. In a departure from the usual
way of measuring OoV rates, web search logs allow us to compute the per-session OoV
rate and thus estimate the percentage of users that experience a given OoV rate.
Under very conservative text normalization, we ﬁnd that a voice search vocabulary
consisting of 2 to 2.5M words extracted from 1 week of search query data will
result in an aggregate OoV rate of 0.01; at that size, the same OoV rate will also
be experienced by 90% of users. The number of words included in the vocabulary is a
stable indicator of the OoV rate. Altering the freshness of the vocabulary or the
duration of the time window over which the training data is gathered does not
signiﬁcantly change the OoV rate. Surprisingly, a signiﬁcantly larger vocabulary
(approx. 10 million words) is required to guarantee OoV rates below 0.01 (1%) for
95% of the users.