8M  Dataset Explore Download About


We offer the YouTube8M dataset for download as TensorFlow Record files. We provide downloader script that fetches the dataset in shards and stores them in the current directory (output of pwd). It can be restarted if the connection drops. In which case, it only downloads shards that haven't been downloaded yet. We also provide html index pages listing all shards, if you'd like to manually download them.

There are two versions of the features: frame-level and video-level features.

In addition to the dataset, we will soon offer TensorFlow starter code for training a baseline model on this data.

The code and dataset are licensed by Google Inc. under license Apache 2.0.

Starter Code

Starter code will be available soon for download. Please subscribe to our mailing list by joining the Google Group: youtube8m-users.

Frame-level features dataset

Frame-level features are stored as tensorflow.SequenceExample protocol buffer. One frame per second from the video was extracted and fed into an inception network pre-trained on ImageNet. Each frame feature vector has 1024 dimensions. The number of the features in the sequence is equal to the length of the video in seconds (capped to the first 300 seconds). The total data size is less than 1.5 Terabytes.

To download the frame-level features, you have the following options:
  • Manually download all 4096 shards from the frame-level training and the frame-level validation partitions. You may also find it useful to download a handful of shards, start developing your code against those shards, and in conjunction kick-off the larger download.
  • Use our python download script. This assumes that you have python and curl installed.

    To download the Frame-level dataset using the download script, navigate your terminal to a directory where you would like to download the data. For example:
    mkdir -p ~/data/yt8m; cd ~/data/yt8m
    Then download the training and validation data. Note: The files are large and download can take over a day, occupying ~1.5 TB of space. Download the entire dataset as follows:
    curl us.data.yt8m.org/0/train/download.py | python
    curl us.data.yt8m.org/0/validate/download.py | python
    #curl us.data.yt8m.org/0/test/download.py | python # test partition coming soon
    The above uses the us mirror. If you are located in Europe or Asia, please swap the domain prefix us with eu or asia, respectively.

Video-level features dataset

Video-level features are stored as tensorflow.Example protocol buffer. Each video has a single feature vector of size 1024, which is the mean of frame-level features for the video. Similar to above, we offer two download options:
Google Google About Google Privacy Terms Feedback