Audio Set: An ontology and human-labeled dataset for audio events
Venue
Proc. IEEE ICASSP 2017, New Orleans, LA (to appear)
Publication Year
2017
Authors
Jort F. Gemmeke, Daniel P. W. Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R. Channing Moore, Manoj Plakal, Marvin Ritter
BibTeX
Abstract
Audio event recognition, the human-like ability to identify and relate sounds from
audio, is a nascent problem in machine perception. Comparable problems such as
object detection in images have reaped enormous benefits from comprehensive
datasets -- principally ImageNet. This paper describes the creation of Audio Set, a
large-scale dataset of manually-annotated audio events that endeavors to bridge the
gap in data availability between image and audio research. Using a carefully
structured hierarchical ontology of 635 audio classes guided by the literature and
manual curation, we collect data from human labelers to probe the presence of
specific audio classes in 10 second segments of YouTube videos. Segments are
proposed for labeling using searches based on metadata, context (e.g., links), and
content analysis. The result is a dataset of unprecedented breadth and size that
will, we hope, substantially stimulate the development of high-performance audio
event recognizers.