We describe the DeepMind Kinetics human action video dataset. The dataset contains
400 human action classes, with at least 400 video clips for each action. Each clip
lasts around 10s and is taken from a different YouTube video. The actions are human
focussed and cover a broad range of classes including human-object interactions
such as playing instruments, as well as human-human interactions such as shaking
hands. We describe the statistics of the dataset, how it was collected, and give
some baseline performance figures for neural network architectures trained and
tested for human action classification on this dataset.