AVA  Dataset Explore Download About

Download (v2.1)

Download all data: ava_v2.1.zip

The AVA v2.1 dataset contains 430 videos split into 235 for training, 64 for validation, and 131 for test. Each video has 15 minutes annotated in 1 second intervals. The annotations are provided as CSV files:

For Task B - Spatio-temporal Action Localization (AVA) at the ActivityNet 2018 Challenge, we're releasing the video ids for a set of 131 labeled test videos. The challenge will only evaluate performance on a subset of 60 classes. For details on how to submit your predictions on these videos please see the ActivityNet 2018 Challenge page.

Generally raters provided annotations at timestamps 902:1798 inclusive, in seconds, at 1-second intervals. Performance is measured on all of these "included" timestamps, including those for which raters determined no action was present. For certain videos, some timestamps were excluded from annotation because raters marked the corresponding video clips as inappropriate. Performance is not measured on the "excluded" timestamps. The lists of included and excluded timestamps are:

CSV Format

Each row contains an annotation for one person performing an action in an interval, where that annotation is associated with the middle frame. Different persons and multiple action labels are described in separate rows.

The format of a row is the following: video_id, middle_frame_timestamp, person_box, action_id

  • video_id: YouTube identifier
  • middle_frame_timestamp: in seconds from the start of the YouTube.
  • person_box: top-left (x1, y1) and bottom-right (x2,y2) normalized with respect to frame size, where (0.0, 0.0) corresponds to the top left, and (1.0, 1.0) corresponds to bottom right.
  • action_id: identifier of an action class, see ava_action_list_v2.1.pbtxt

AVA v2.1 differs from v2.0 only by the removal of a small number of movies that were determined to be duplicates. The class list and label map remain unchanged from v1.0.

The dataset is made available by Google Inc. under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.

Pre-trained Model

A pre-trained baseline model is also available. It was created using the Tensorflow Object Detection API.

The baseline model is an image-based Faster RCNN detector with ResNet-101 feature extractor. Compared with other commonly used object detectors, the action classification loss function has been changed to per-class sigmoid loss to handle boxes with multiple labels. The model was trained on the training split of AVA v2.1 for 1.5M iterations, and achieves mean AP of 11.25% over 60 classes on the validation split of AVA v2.1.

The model checkpoint can be obtained here. The predictions of this model on the AVA v2.1 validation split, in the CSV format described above, can be downloaded here: ava_baseline_detections_val_v2.1.zip.

Download (v1.0)

Files from the previous version of AVA can be downloaded here:
Google Google About Google Privacy Terms Feedback