AVA  Dataset Explore Download Challenge About

AVA Actions Download (v2.1)

Download all AVA Actions data: ava_v2.1.zip

The AVA v2.1 dataset contains 430 videos split into 235 for training, 64 for validation, and 131 for test. Each video has 15 minutes annotated in 1 second intervals. The annotations are provided as CSV files:

For Task B - Spatio-temporal Action Localization (AVA) at the ActivityNet 2018 Challenge, we're releasing the video ids for a set of 131 labeled test videos. The challenge will only evaluate performance on a subset of 60 classes. For details on how to submit your predictions on these videos please see the ActivityNet 2018 Challenge page.

Generally raters provided annotations at timestamps 902:1798 inclusive, in seconds, at 1-second intervals. Performance is measured on all of these "included" timestamps, including those for which raters determined no action was present. For certain videos, some timestamps were excluded from annotation because raters marked the corresponding video clips as inappropriate. Performance is not measured on the "excluded" timestamps. The lists of included and excluded timestamps are:

CSV Format

Each row contains an annotation for one person performing an action in an interval, where that annotation is associated with the middle frame. Different persons and multiple action labels are described in separate rows.

The format of a row is the following: video_id, middle_frame_timestamp, person_box, action_id, person_id

  • video_id: YouTube identifier
  • middle_frame_timestamp: in seconds from the start of the YouTube.
  • person_box: top-left (x1, y1) and bottom-right (x2,y2) normalized with respect to frame size, where (0.0, 0.0) corresponds to the top left, and (1.0, 1.0) corresponds to bottom right.
  • action_id: identifier of an action class, see ava_action_list_v2.1.pbtxt
  • person_id: a unique integer allowing this box to be linked to other boxes depicting the same person in adjacent frames of this video.

AVA v2.1 differs from v2.0 only by the removal of a small number of movies that were determined to be duplicates. The class list and label map remain unchanged from v1.0.

The dataset is made available by Google Inc. under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.

Evaluation Code

Code for running the Frame-mAP evaluation can be found in the ActivityNet GitHub.

Pre-trained Model

A pre-trained baseline model is also available. It was created using the Tensorflow Object Detection API.

The baseline model is an image-based Faster RCNN detector with ResNet-101 feature extractor. Compared with other commonly used object detectors, the action classification loss function has been changed to per-class sigmoid loss to handle boxes with multiple labels. The model was trained on the training split of AVA v2.1 for 1.5M iterations, and achieves mean AP of 11.25% over 60 classes on the validation split of AVA v2.1.

The model checkpoint can be obtained here. The predictions of this model on the AVA v2.1 validation split, in the CSV format described above, can be downloaded here: ava_baseline_detections_val_v2.1.zip.

Download (v1.0)

Files from the previous version of AVA can be downloaded here:

AVA Speech Download(v1.0)

Download AVA Speech Labels: ava_speech_labels_v1.csv

Each row contains an annotation for an interval of a video clip. For each video in the dataset, 15 minutes (starting at 15 minutes and 0 seconds to 30 minutes 0 seconds) are densely labeled for speech activity using one of the 4 possible labels: {NO_SPEECH, CLEAN_SPEECH, SPEECH_WITH_MUSIC, SPEECH_WITH_NOISE}. Each new label appears in a separate row.

The format of a row is the following: video_id, label_start_timestamp_seconds, label_end_timestamp_seconds, label

  • video_id: YouTube identifier
  • label_start_timestamp_seconds: in seconds from the start of the video.
  • label_end_timestamp_seconds: in seconds from the end of the video.
  • label: label for the interval specified by the start and end timestamps. This will be one of {NO_SPEECH, CLEAN_SPEECH, SPEECH_WITH_MUSIC, SPEECH_WITH_NOISE}.

The AVA speech labels v1.0 release contains dense labels for 160 videos (from the original list of 188 videos in AVA v1.0) that are still available on YouTube.

Google Google About Google Privacy Terms Feedback