Training Data Selection Based On Context-Dependent State Matching
Abstract
In this paper we construct a data set for semi-supervised acoustic model training
by selecting spoken utterances from a massive collection of anonymized Google Voice
Search utterances. Semi-supervised training usually retains high-confidence
utterances which are presumed to have an accurate hypothesized transcript, a
necessary condition for successful training. Selecting high confidence utterances
can however restrict the diversity of the resulting data set. We propose to
introduce a constraint enforcing that the distribution of the context-dependent
state symbols obtained by running forced alignment of the hypothesized transcript
matches a reference distribution estimated from a curated development set. The
quality of the obtained training set is illustrated on large scale Voice Search
recognition experiments and outperforms random selection of high-confidence
utterances.
