The AVA v2.1 dataset contains 430 videos split into 235 for training, 64 for validation, and 131 for test. Each video has 15 minutes annotated in 1 second intervals. The annotations are provided as CSV files:
For Task B - Spatio-temporal Action Localization (AVA) at the ActivityNet 2018 Challenge, we're releasing the video ids for a set of 131 labeled test videos. The challenge will only evaluate performance on a subset of 60 classes. For details on how to submit your predictions on these videos please see the ActivityNet 2018 Challenge page.
Generally raters provided annotations at timestamps 902:1798 inclusive, in seconds, at 1-second intervals. Performance is measured on all of these "included" timestamps, including those for which raters determined no action was present. For certain videos, some timestamps were excluded from annotation because raters marked the corresponding video clips as inappropriate. Performance is not measured on the "excluded" timestamps. The lists of included and excluded timestamps are:
Each row contains an annotation for one person performing an action in an interval, where that annotation is associated with the middle frame. Different persons and multiple action labels are described in separate rows.
The format of a row is the following: video_id, middle_frame_timestamp, person_box, action_id, person_id
AVA v2.1 differs from v2.0 only by the removal of a small number of movies that were determined to be duplicates. The class list and label map remain unchanged from v1.0.
The dataset is made available by Google Inc. under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.
Code for running the Frame-mAP evaluation can be found in the ActivityNet GitHub.
A pre-trained baseline model is also available. It was created using the Tensorflow Object Detection API.
The baseline model is an image-based Faster RCNN detector with ResNet-101 feature extractor. Compared with other commonly used object detectors, the action classification loss function has been changed to per-class sigmoid loss to handle boxes with multiple labels. The model was trained on the training split of AVA v2.1 for 1.5M iterations, and achieves mean AP of 11.25% over 60 classes on the validation split of AVA v2.1.
The model checkpoint can be obtained here. The predictions of this model on the AVA v2.1 validation split, in the CSV format described above, can be downloaded here: ava_baseline_detections_val_v2.1.zip.