A sound vocabulary and dataset

AudioSet consists of an expanding ontology of 632 audio event classes and a collection of 2,084,320 human-labeled 10-second sound clips drawn from YouTube videos. The ontology is specified as a hierarchical graph of event categories, covering a wide range of human and animal sounds, musical instruments and genres, and common everyday environmental sounds.

By releasing AudioSet, we hope to provide a common, realistic-scale evaluation task for audio event detection, as well as a starting point for a comprehensive vocabulary of sound events.

Explore the ontology

2.1 million

annotated videos

5.8 thousand

hours of audio

527 classes

of annotated sounds

Large-scale data collection

To collect all our data we worked with human annotators who verified the presence of sounds they heard within YouTube segments. To nominate segments for annotation, we relied on YouTube metadata and content-based search.

Our resulting dataset has excellent coverage over the audio event classes in our ontology.

Explore the dataset

Explore further

The ontology and dataset construction are described in more detail in our ICASSP 2017 paper. You can contribute to the ontology at our GitHub repository. The dataset and machine-extracted features are available at the download page.

People

This dataset is brought to you from the Sound Understanding group in the Machine Perception Research organization at Google. More about us.

If you want to stay up-to-date about this dataset, please subscribe to our Google Group: audioset-users. The group should be used for discussions about the dataset and the starter code.