AVA  Dataset Explore Download Challenge About

AVA-Kinetics Download (v1.0)

Download all AVA-Kinetics data: ava_kinetics_v1_0.tar.gz

The AVA-Kinetics dataset consists of the original 430 videos from AVA v2.2, together with 238k videos from the Kinetics-700 dataset. For Kinetics we provide one annotated frame per video clip. The annotations are provided as CSV files, as described in the included README.txt file.

All of the annotation is provided in the in the .tar.gz file. Although there are separate CSV files for AVA and for Kinetics, it's expected that users will want to train / test on the union of the two.

AVA Actions Download (v2.2)

Download all AVA Actions data: ava_v2.2.zip

The AVA v2.2 dataset contains 430 videos split into 235 for training, 64 for validation, and 131 for test. Each video has 15 minutes annotated in 1 second intervals. The annotations are provided as CSV files:

For Task B - Spatio-temporal Action Localization (AVA) at the ActivityNet 2019 Challenge, we're releasing the video ids for a set of 131 labeled test videos. The challenge will only evaluate performance on a subset of 60 classes. For details on how to submit your predictions on these videos please see the ActivityNet 2019 Challenge page.

Generally raters provided annotations at timestamps 902:1798 inclusive, in seconds, at 1-second intervals. Performance is measured on all of these "included" timestamps, including those for which raters determined no action was present. For certain videos, some timestamps were excluded from annotation because raters marked the corresponding video clips as inappropriate. Performance is not measured on the "excluded" timestamps. The lists of included and excluded timestamps are:

CSV Format

Each row contains an annotation for one person performing an action in an interval, where that annotation is associated with the middle frame. Different persons and multiple action labels are described in separate rows.

The format of a row is the following: video_id, middle_frame_timestamp, person_box, action_id, person_id

  • video_id: YouTube identifier
  • middle_frame_timestamp: in seconds from the start of the YouTube.
  • person_box: top-left (x1, y1) and bottom-right (x2,y2) normalized with respect to frame size, where (0.0, 0.0) corresponds to the top left, and (1.0, 1.0) corresponds to bottom right.
  • action_id: identifier of an action class, see ava_action_list_v2.2.pbtxt
  • person_id: a unique integer allowing this box to be linked to other boxes depicting the same person in adjacent frames of this video.

AVA v2.2 differs from v2.1 in two ways. First another round of human rating was conducted to insert labels that were missing, increasing the number of annotations by 2.5%. Second, boxes locations were corrected for a small number of videos with aspect ratios much larger than 16:9.

AVA v2.1 differs from v2.0 only by the removal of a small number of movies that were determined to be duplicates. The class list and label map remain unchanged from v1.0.

Evaluation Code

Code for running the Frame-mAP evaluation can be found in the ActivityNet GitHub.

Pre-trained Model

A pre-trained baseline model is also available. It was created using the Tensorflow Object Detection API.

The baseline model is an image-based Faster RCNN detector with ResNet-101 feature extractor. Compared with other commonly used object detectors, the action classification loss function has been changed to per-class sigmoid loss to handle boxes with multiple labels. The model was trained on the training split of AVA v2.1 for 1.5M iterations, and achieves mean AP of 11.25% over 60 classes on the validation split of AVA v2.1.

The model checkpoint can be obtained here. The predictions of this model on the AVA v2.1 validation split, in the CSV format described above, can be downloaded here: ava_baseline_detections_val_v2.1.zip.

Download (previous versions)

Files from the previous version of AVA can be downloaded here:

AVA Active Speaker Download(v1.0)

Download AVA Active Speaker Labels:

Each zipped file contains a set of CSV files, one for each video in the [train/val] partition. The partitions are the same as AVA Actions.

For Task B, Challenge #2- Active Speaker Detection at the ActivityNet 2019 Challenge, we're releasing data for 131 videos with labels anonymized. For details on how to submit your predictions on these videos, please see the information for Challenge #2 on the ActivityNet 2019 Challenge: Task B page.

Each row in the CSV files contains an annotation for speaking activity associated with a single face for that frame. Different persons are described in separate rows. The format of a row is the following: video_id, frame_timestamp, entity_box, label, entity_id.

  • video_id: YouTube identifier
  • frame_timestamp: in seconds from the start of the video.
  • entity_box: top-left (x1, y1) and bottom-right (x2,y2) normalized with respect to frame size, where (0.0, 0.0) corresponds to the top left, and (1.0, 1.0) corresponds to bottom right.
  • label: label for the entity specified at that frame. This will be one of {SPEAKING_AND_AUDIBLE, SPEAKING_BUT_NOT_AUDIBLE, NOT_SPEAKING}.
  • entity_id: a unique string allowing this box to be linked to other boxes depicting the same person in adjacent frames of this video.

The AVA Active Speaker labels v1.0 release contains dense labels for 160 videos (from the original list of 188 videos in AVA v1.0) that are still available on YouTube.

AVA Speech Download(v1.0)

Download AVA Speech Labels: ava_speech_labels_v1.csv

Each row contains an annotation for an interval of a video clip. For each video in the dataset, 15 minutes (starting at 15 minutes and 0 seconds to 30 minutes 0 seconds) are densely labeled for speech activity using one of the 4 possible labels: {NO_SPEECH, CLEAN_SPEECH, SPEECH_WITH_MUSIC, SPEECH_WITH_NOISE}. Each new label appears in a separate row.

The format of a row is the following: video_id, label_start_timestamp_seconds, label_end_timestamp_seconds, label

  • video_id: YouTube identifier
  • label_start_timestamp_seconds: in seconds from the start of the video.
  • label_end_timestamp_seconds: in seconds from the end of the video.
  • label: label for the interval specified by the start and end timestamps. This will be one of {NO_SPEECH, CLEAN_SPEECH, SPEECH_WITH_MUSIC, SPEECH_WITH_NOISE}.

The AVA speech labels v1.0 release contains dense labels for 160 videos (from the original list of 188 videos in AVA v1.0) that are still available on YouTube.

License

All datasets listed here are made available by Google Inc. under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.

Google Google About Google Privacy Terms Feedback