Publication Data
Sound Ranking Using Auditory Sparse-Code Representations
Abstract: The task of ranking sounds from text queries is a good test
application for machine-hearing techniques, and particularly for comparison and
evaluation of alternative sound representations in a large-scale setting. We have
adapted a machine-vision system, ``passive-aggressive model for image retrieval''
(PAMIR), which efficiently learns, using a ranking-based cost function, a linear
mapping from a very large sparse feature space to a large query-term space. Using this
system allows us to focus on comparison of different auditory front ends and different
ways of extracting sparse features from high-dimensional auditory images. In addition
to two main auditory-image models, we also include and compare a family of more
conventional MFCC front ends. The experimental results show a significant advantage for
the auditory models over vector-quantized MFCCs. The two auditory models tested use the
adaptive pole-zero filter cascade (PZFC) auditory filterbank and sparse-code feature
extraction from stabilized auditory images via multiple vector quantizers. The models
differ in their implementation of the strobed temporal integration used to generate the
stabilized image. Using ranking precision-at-top-k performance measures, the best
results are about 70% top-1 precision and 35% average precision, using a test corpus of
thousands of sound files and a query vocabulary of hundreds of words.
