A big data approach to acoustic model training corpus selection
Venue
Conference of the International Speech Communication Association (Interspeech) (2014)
Publication Year
2014
Authors
Olga Kapralova, John Alex, Eugene Weinstein, Pedro Moreno, Olivier Siohan
BibTeX
Abstract
Deep neural networks (DNNs) have recently become the state of the art technology in
speech recognition systems. In this paper we propose a new approach to constructing
large high quality unsupervised sets to train DNN models for large vocabulary
speech recognition. The core of our technique consists of two steps. We first
redecode speech logged by our production recognizer with a very accurate (and hence
too slow for real-time usage) set of speech models to improve the quality of ground
truth transcripts used for training alignments. Using confidence scores, transcript
length and transcript flattening heuristics designed to cull salient utterances
from three decades of speech per language, we then carefully select training data
sets consisting of up to 15K hours of speech to be used to train acoustic models
without any reliance on manual transcription. We show that this approach yields
models with approximately 18K context dependent states that achieve 10% relative
improvement in large vocabulary dictation and voice-search systems for Brazilian
Portuguese, French, Italian and Russian languages.
