Discriminative Keyword Spotting
Venue
Automatic Speech and Speaker Recognition: Large Margin and Kernel Methods, Wiley (2009)
Publication Year
2009
Authors
David Grangier, Joseph Keshet, Samy Bengio
BibTeX
Abstract
This chapter introduces a discriminative method for detecting and spotting keywords
in spoken utterances. Given a word represented as a sequence of phonemes and a
spoken utterance, the keyword spotter predicts the best time span of the phoneme
sequence in the spoken utterance along with a confidence. If the prediction
confidence is above certain level the keyword is declared to be spoken in the
utterance within the predicted time span, otherwise the keyword is declared as not
spoken. The problem of keyword spotting training is formulated as a discriminative
task where the model parameters are chosen so the utterance in which the keyword is
spoken would have higher confidence than any other spoken utterance in which the
keyword is not spoken. It is shown theoretically and empirically that the proposed
training method resulted with a high area under the receiver operating (ROC) (ROC)
curve, the most common measure to evaluate keyword spotters. We present an
iterative algorithm to train the keyword spotter efficiently. The proposed approach
contrasts with standard spotting strategies based on HMMs, for which the training
procedure does not maximize a loss directly related to the spotting performance.
Several experiments performed on TIMIT and WSJ corpora show the advantage of our
approach over HMM-based alternatives.
