AudioSet

Temporally-Strong Labels Download (May 2021)

For the original release of 10-sec-resolution labels, see the Download page.

In 2020, we performed additional annotation on some of the AudioSet clips, this time using a procedure that instructed the annotators to mark every distinct sound event they perceived (complete annotation), and to indicate the start and end times of each event by dragging out a region on a spectrogram (“strong” labeling). We collected these new annotations for all 16,996 clips of the evaluation set (excluding the ones that have become unavailable since the original release), and 103,463 clips from the training set (about 5%, chosen at random). The annotators had to choose from a hierarchical, closed vocabulary of around 600 sound event labels; they were instructed to use the most specific label available. There are 456 distinct labels in the data. Unlike the original AudioSet, we did not record any detail within musical segments; such sounds were simply labeled as music.

We are releasing these data to accompany our ICASSP 2021 paper. These data are being made available by Google Inc. under a Creative Commons Attribution 4.0 International (CC BY 4.0) license. The primary strong-label files are in a tab-separated-value format based on truth files from DCASE 2019 Task 4, specifically:

<clip_id>\t<start_time_seconds>\t<end_time_seconds>\t<MID>

Where clip_id is in the format ytid_startimems with ytid as the parent YouTube id and starttimems indicates the beginning of the 10 sec clip that was annotated within that clip’s soundtrack. MID is the machine ID of the sound event class, and \t indicates a tab character. For example:

s9d-2nhuJCQ_30000   2.627      7.237           /m/053hz1

..indicates that for the excerpt spanning time 30-40 sec within the YouTube clip s9d-2nhuJCQ, the annotators identified an instance of Cheering (MID /m/053hz1) occurring from t=2.627 sec to t=7.237 sec (4.61 sec duration) within the excerpt. Since each excerpt in general includes multiple sound events, there are multiple lines with the same clip id in each file.

The file audioset_train_strong.tsv describes 934,821 sound events across the 103,463 excerpts from the training set. There are 447 MIDs present, of which 376 are shared with the 527 labels in the original AudioSet data (see discussion in this GitHub issue). Note that this comprises significantly more than the 66,924 excerpts promised in the ICASSP paper, reflecting additional annotations collected since writing the paper.

The file audioset_eval_strong.tsv describes 139,538 segments across the 16,996 excerpts from the evaluation set. There are 416 unique MIDs, 9 of which are not present in the train labels. 381 of the MIDs are shared with the original AudioSet data.

We include a second version of labels for a subset of the evaluation set, corresponding to the evaluation numbers reported in our paper. This includes both positive (present) and negative (confirmed not present) labels, where the not-present labels were chosen to prefer confusable clips (e.g., clips that scored higher for that class under a classifier, despite being confirmed as negative), and both positive and negative clips were, as far as possible, balanced at around 150 excerpts per class (where a single excerpt can contribute up to 10 individual 960 ms segments).

The labels were additionally projected onto a 960 ms grid, so that every label covers exactly 960 ms. (Evaluation is then performed based on scores averaged over that 960 ms support). Finally, we add "complementary negatives" - 960 ms frames that have zero intersection with a positive label in the clip are asserted as negatives, to better reward classification with accurate temporal resolution.

Because this set includes both positive and negative labels, we include a 5th field in the tab-separated values, i.e.:

<clip_id>\t<start_time_seconds>\t<end_time_seconds>\t<MID>\t<PRESENT|NOT_PRESENT>

For example:

YxlGt805lTA_30000       0.960   1.920   /m/04rlf        PRESENT
YxlGt805lTA_30000       0.960   1.920   /m/07rgkc5      NOT_PRESENT

..indicates that "Music" (/m/04rlf) was marked PRESENT during the second 960 ms frame in the 10 sec clip starting at 30 sec in YouTube video YxlGt805lTA, but "Static" (/m/07rgkc5) was marked NOT_PRESENT.

The file audioset_eval_strong_framed_posneg.tsv includes 300,307 positive labels, and 658,221 negative labels within 14,203 excerpts from the evaluation set. There are 356 MIDs covered by both the positive and negative labels, chosen as the classes (also included in the original AudioSet release) with sufficient representation in the original strong labels to allow meaningful evaluation.

Finally, the file mid_to_display_name.tsv maps the 456 MIDs mentioned in the label files to their human-readable names, e.g.

/m/01280g    Wild animals
/m/012f08    Motor vehicle (road)