The AVA dataset contains 192 videos split into 154 training and 38 test videos. Each video
has 15 minutes annotated in 3 second intervals, resulting in 300 annotated segments. These
annotations are specified by two CSV files:
Each row contains an annotation for one person performing an action in an interval, where that annotation is associated with the middle frame. Different persons and multiple action labels are described in separate rows.
The format of a row is the following: video_id, middle_frame_timestamp, person_box, action_id