YouTube-8M is a large-scale labeled video dataset that consists of millions of YouTube video IDs and
associated labels from a diverse vocabulary of 4700+ visual entities. It comes with precomputed
state-of-the-art audio-visual features from billions of frames and audio segments, designed to fit on a single hard disk. This
makes it possible to get started on this dataset by training a baseline video model in less than a day
on a single machine! At the same time, the dataset's scale and diversity can enable deep exploration of
complex audio-visual models that can take weeks to train even in a distributed fashion.
Our goal is to accelerate research on large-scale video understanding, representation
learning, noisy data modeling, transfer learning, and domain adaptation approaches for video.
More details about the dataset and initial experiments can be found in our
Some statistics from the latest version of the dataset are included below.
Hours of Video
Avg. Labels / Video
The (multiple) labels per video are Knowledge Graph
entities, organized into 24 top-level verticals.
Each entity represents a semantic topic that is visually recognizable in video, and the video labels reflect the main topics of each video.
You can download a CSV file
of our vocabulary.
The first field in the file corresponds to each label's index in the dataset files
with the first label corresponding to index 0. The CSV file contains the following columns:
The entity frequencies are plotted below in log-log scale, which shows a Zipf-like distribution:
In addition, we show histograms with the number of entities and number of training videos in each top-level vertical:
This dataset is brought to you from the Video Understanding group at
If you want to stay up-to-date about this dataset,
please subscribe to our
Google Group: youtube8m-users
The group should be used for discussions about the dataset and the starter code.