Publication Data
Discriminative Keyword Spotting
Abstract: This chapter introduces a discriminative method for
detecting and spotting keywords in spoken utterances. Given a word represented as a
sequence of phonemes and a spoken utterance, the keyword spotter predicts the best time
span of the phoneme sequence in the spoken utterance along with a confidence. If the
prediction confidence is above certain level the keyword is declared to be spoken in
the utterance within the predicted time span, otherwise the keyword is declared as not
spoken. The problem of keyword spotting training is formulated as a discriminative task
where the model parameters are chosen so the utterance in which the keyword is spoken
would have higher confidence than any other spoken utterance in which the keyword is
not spoken. It is shown theoretically and empirically that the proposed training method
resulted with a high area under the receiver operating (ROC) (ROC) curve, the most
common measure to evaluate keyword spotters. We present an iterative algorithm to train
the keyword spotter efficiently. The proposed approach contrasts with standard spotting
strategies based on HMMs, for which the training procedure does not maximize a loss
directly related to the spotting performance. Several experiments performed on TIMIT
and WSJ corpora show the advantage of our approach over HMM-based alternatives.
