AVA is a project that provides audiovisual annotations of video for improving our understanding of human activity. The annotated videos are all 15 minute long movie clips. Each of the clips has been exhaustively labeled by human annotators, and the use of movie clips in the dataset is expected to enable a richer variety of recording conditions and representations of human activity.
We provide two sets of annotations.
The AVA dataset densely annotates 80 atomic visual actions in 430 15-minute movie clips, where actions are localized in space and time, resulting in 1.58M action labels with multiple labels per human occurring frequently. A detailed description of our contributions with this dataset can be found in our accompanying CVPR '18 paper.
The AVA-Speech dataset densely annotates speech activity for the movie clips in the AVA v1.0 dataset. It explicitly labels 3 background noise conditions (Clean Speech, Speech with background Music, and Speech with background Noise), resulting in ~40K labeled segments spanning 40 hours of data. The labels are available for download here. A detailed description of the dataset can be found in our accompanying Interspeech '18 paper.
For announcements and details on the 2019 challenge, please sign up to the Google Group: ava-dataset-users.