Hi-hat (3,900 annotations in dataset)

►

Vehicle (128,051 annotations in dataset)

►

Gunshot, gunfire (4,221 annotations in dataset)

►

Laughter (5,696 annotations in dataset)

►

Car (41,554 annotations in dataset)

►

Applause (2,247 annotations in dataset)

►

Drum (20,246 annotations in dataset)

►

Bass drum (9,292 annotations in dataset)

►

Stream (2,847 annotations in dataset)

►

Brass instrument (7,513 annotations in dataset)

►

Water tap, faucet (2,442 annotations in dataset)

►

Motorboat, speedboat (8,078 annotations in dataset)

►

Speech (1,010,480 annotations in dataset)

►

Cymbal (4,688 annotations in dataset)

►

Baby cry, infant cry (2,390 annotations in dataset)

►

Music (1,011,305 annotations in dataset)

►

Siren (8,498 annotations in dataset)

►

Violin, fiddle (28,125 annotations in dataset)

►

Choir (6,709 annotations in dataset)

►

Acoustic guitar (14,568 annotations in dataset)

►

Engine (16,245 annotations in dataset)

►

Fireworks (3,051 annotations in dataset)

►

Rapping (4,496 annotations in dataset)

►

A sound vocabulary and dataset

AudioSet consists of an expanding ontology of 632 audio event classes and a collection of 2,084,320 human-labeled 10-second sound clips drawn from YouTube videos. The ontology is specified as a hierarchical graph of event categories, covering a wide range of human and animal sounds, musical instruments and genres, and common everyday environmental sounds.

By releasing AudioSet, we hope to provide a common, realistic-scale evaluation task for audio event detection, as well as a starting point for a comprehensive vocabulary of sound events.

Explore the ontology

2.1 million

annotated videos

5.8 thousand

hours of audio

527 classes

of annotated sounds

Large-scale data collection

To collect all our data we worked with human annotators who verified the presence of sounds they heard within YouTube segments. To nominate segments for annotation, we relied on YouTube metadata and content-based search.

Our resulting dataset has excellent coverage over the audio event classes in our ontology.

Explore the dataset

Explore further

The ontology and dataset construction are described in more detail in our ICASSP 2017 paper. You can contribute to the ontology at our GitHub repository. The dataset and machine-extracted features are available at the download page.

People

This dataset is brought to you from the Sound Understanding group in the Machine Perception Research organization at Google. More about us.

If you want to stay up-to-date about this dataset, please subscribe to our Google Group: audioset-users. The group should be used for discussions about the dataset and the starter code.