Large-Scale Audio Event Discovery in One Million YouTube Videos
Venue
Proceedings of ICASSP (2017) (to appear)
Publication Year
2017
Authors
Aren Jansen, Jort F. Gemmeke, Daniel P. W. Ellis, Xiaofeng Liu, Wade Lawrence, Dylan Freedman
BibTeX
Abstract
Internet videos provide a virtually boundless source of audio with a conspicuous
lack of localized annotations, presenting an ideal setting for unsupervised
methods. With this motivation, we perform an unprecedented exploration into the
large-scale discovery of recurring audio events in a diverse corpus of one million
YouTube videos (45K hours of audio). Our approach is to apply a streaming,
nonparametric clustering algorithm to both spectral features and out-of-domain
neural audio embeddings, and use a small portion of manually annotated audio events
to quantitatively estimate the intrinsic clustering performance. In addition to
providing a useful mechanism for unsupervised active learning, we demonstrate the
effectiveness of the discovered audio event clusters in two downstream
applications. The first is weakly-supervised learning, where we exploit the
association of video-level metadata and cluster occurrences to temporally localize
audio events. The second is informative activity detection, an unsupervised method
for semantic saliency based on the corpus statistics of the discovered event
clusters.