Unsupervised Discovery and Training of Maximally Dissimilar Cluster Models
Abstract
One of the difficult problems of acoustic modeling for Automatic
Speech Recognition (ASR) is how to adequately model
the wide variety of acoustic conditions which may be present
in the data. The problem is especially acute for tasks such as
Google Search by Voice, where the amount of speech available
per transaction is small, and adaptation techniques start showing
their limitations. As training data from a very large user
population is available however, it is possible to identify and
jointly model subsets of the data with similar acoustic qualities.
We describe a technique which allows us to perform this
modeling at scale on large amounts of data by learning a treestructured
partition of the acoustic space, and we demonstrate
that we can significantly improve recognition accuracy in various
conditions through unsupervised Maximum Mutual Information
(MMI) training. Being fully unsupervised, this technique
scales easily to increasing numbers of conditions.
Speech Recognition (ASR) is how to adequately model
the wide variety of acoustic conditions which may be present
in the data. The problem is especially acute for tasks such as
Google Search by Voice, where the amount of speech available
per transaction is small, and adaptation techniques start showing
their limitations. As training data from a very large user
population is available however, it is possible to identify and
jointly model subsets of the data with similar acoustic qualities.
We describe a technique which allows us to perform this
modeling at scale on large amounts of data by learning a treestructured
partition of the acoustic space, and we demonstrate
that we can significantly improve recognition accuracy in various
conditions through unsupervised Maximum Mutual Information
(MMI) training. Being fully unsupervised, this technique
scales easily to increasing numbers of conditions.