Sound Retrieval and Ranking Using Sparse Auditory Representations
Venue
Neural Computation, vol. 22 (2010), pp. 2390-2416
Publication Year
2010
Authors
Richard F Lyon, Martin Rehn, Samy Bengio, Thomas C. Walters, Gal Chechik
BibTeX
Abstract
To create systems that understand the sounds that humans are exposed to in everyday
life, we need to represent sounds with features that can discriminate among many
different sound classes. Here, we use a sound-ranking framework to quantitatively
evaluate such representations in a large scale task. We have adapted a
machine-vision method, the ``passive-aggressive model for image retrieval''
(PAMIR), which efficiently learns a linear mapping from a very large sparse feature
space to a large query-term space. Using this approach we compare different
auditory front ends and different ways of extracting sparse features from
high-dimensional auditory images. We tested auditory models that use adaptive
pole--zero filter cascade (PZFC) auditory filterbank and sparse-code feature
extraction from stabilized auditory images via multiple vector quantizers. In
addition to auditory image models, we also compare a family of more conventional
Mel-Frequency Cepstral Coefficient (MFCC) front ends. The experimental results show
a significant advantage for the auditory models over vector-quantized MFCCs.
Ranking thousands of sound files with a query vocabulary of thousands of words, the
best precision at top-1 was 73% and the average precision was 35%, reflecting a 18%
improvement over the best competing MFCC.
