YouTube-8M: A Large and Diverse Labeled Video Dataset for Video Understanding Research

Who are we?

We are part of the Video Understanding group within Google Research. We work on building computer vision and video understanding systems at large scales, making it easier to find and discover great video content on YouTube and the web, and helping personal video collections become useful, delightful, and entertaining. Our long-term technology mission is to achieve the ability to understand and describe video at the level of a human expert, purely from pixels and audio samples.

We created this dataset in order to advance computer vision and video understanding at large scale. The videos were sampled to preserve the very diverse distribution of popular YouTube content, the annotation vocabulary was carefully constructed, and the features were designed to fit on a single commodity hard disk for a million-hour video dataset. This makes it possible to download the dataset on a local machine and train a full-scale model in less than a day on a single GPU! We feel that by giving researchers access to such a large labeled video dataset with precomputed features, we can eliminate storage and computational barriers, and help accelerate research on large-scale video understanding. We hope this dataset will spur exciting new advancements on video modeling architectures and representation learning, especially approaches that deal effectively with noisy or incomplete labels, transfer learning and domain adaptation. Our paper includes details on how we collected the initial dataset, as well as experimental results for some baseline video modeling and domain transfer approaches. In addition, there are additional experiments available in the the 2017 workshop proceedings. If you have questions about the dataset, or would like to be notified of updates, please subscribe to Google Group: youtube8m-users