The AVA dataset densely annotates 80 atomic visual actions in 430 15-minute movie clips, where actions
are localized in space and time, resulting in 1.58M action labels with multiple labels per human
occurring frequently. The key characteristics of our dataset are: (1) the definition of atomic
visual actions, rather than composite actions; (2) precise spatio-temporal annotations with
possibly multiple annotations for each person; (3) exhaustive annotation of these atomic
actions over 15-minute video clips; (4) using movies to gather a varied set of action representations.
AVA v2.1 is now available for download, and described in detail in this arXiv paper. AVA v2.1 is the basis of a challenge at the ActivityNet workshop at CVPR 2018. Details on the specific task are here.