Jump to Content

Kevin Kilgour

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Training Keyword Spotters with Limited and Synthesized Speech Data
    James Lin
    International Conference on Acoustics, Speech, and Signal Processing, IEEE, Barcelona, Spain (2020)
    Preview abstract With the rise of low power speech-enabled devices, there is a growing demand to quickly produce models for recognizing arbitrary sets of keywords. As with many machine learning tasks, one of the most challenging parts in the model creation process is obtaining a sufficient amount of high training data. In this paper, we explore the effectiveness of synthesized speech data in training small spoken term detection models. Instead of training such models directly on the audio or low level feature such as MFCCs we use a small speech embedding model trained to extract useful features for keyword spotting models. Using this embedding, we show that such a model for detecting 10 keywords when trained on only synthetic speech is equivalent to a model trained on over 50 real examples, and to a model trained on 4000 real examples if we do not use the speech embeddings. View details
    Preview abstract We propose the Fréchet Audio Distance (FAD), a novel, reference-free evaluation metric for music enhancement algorithms. We demonstrate how typical evaluation metrics for speech enhancement and blind source separation can fail to accurately measure the perceived effect of a wide variety of distortions. As an alternative, we propose adapting the Fréchet Inception Distance (FID) metric used to evaluate generative image models to the audio domain. FAD is validated using a wide variety of artificial distortions and is compared to the signal based metrics signal to distortion ratio (SDR), cosine distance and magnitude L2 distance. We show that, with a correlation coefficient of 0.52, FAD correlates more closely with human perception than either SDR, cosine distance or magnitude L2 distance, with correlation coefficients of 0.39, -0.15 and -0.01 respectively. View details
    Preview abstract Existing music recognition applications require both user activation and a connection to a server that performs the actual recognition. In this paper we present a low power music recognizer that runs entirely on a mobile phone and automatically recognizes music without requiring any user activation. A small music detector runs continuously on the mobile phone’s DSP (digital signal processor) chip and only wakes main the processor when it is confident that music is present. Once woken the detector on the main processor is provided with an 8s buffer of audio which is then fingerprinted and compared to the stored fingerprints in the on-device fingerprint database of over 70000 songs. View details
    No Results Found