Publication Data
Unsupervised Discovery and Training of Maximally Dissimilar Cluster Models
Abstract: One of the difficult problems of acoustic modeling for
Automatic Speech Recognition (ASR) is how to adequately model the wide variety of
acoustic conditions which may be present in the data. The problem is especially acute
for tasks such as Google Search by Voice, where the amount of speech available per
transaction is small, and adaptation techniques start showing their limitations. As
training data from a very large user population is available however, it is possible to
identify and jointly model subsets of the data with similar acoustic qualities. We
describe a technique which allows us to perform this modeling at scale on large amounts
of data by learning a treestructured partition of the acoustic space, and we
demonstrate that we can significantly improve recognition accuracy in various
conditions through unsupervised Maximum Mutual Information (MMI) training. Being fully
unsupervised, this technique scales easily to increasing numbers of conditions.
