8M  Dataset Explore Download Workshop About

Download

We offer the YouTube8M dataset for download as TensorFlow Record files. We provide downloader script that fetches the dataset in shards and stores them in the current directory (output of pwd). It can be restarted if the connection drops. In which case, it only downloads shards that haven't been downloaded yet. We also provide html index pages listing all shards, if you'd like to manually download them. There are two versions of the features: frame-level and video-level features. The dataset is made available by Google Inc. under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.

Starter Code

Starter code for the dataset can be found on our GitHub page. In addition to training code, you will also find python scripts for evaluating standard metrics for comparisons between models.

Frame-level features dataset

Frame-level features are stored as tensorflow.SequenceExample protocol buffers. A tensorflow.SequenceExample proto is reproduced here in text format:
context: {
  feature: {
    key  : "video_id"
    value: {
      bytes_list: {
        value: [YouTube video id string]
      }
    }
  }
  feature: {
    key  : "labels"
      value: {
        int64_list: {
          value: [1, 522, 11, 172] # The meaning of the labels can be found here.
        }
      }
    }
}

feature_lists: {
  feature_list: {
    key  : "rgb"
    value: {
      feature: {
        bytes_list: {
          value: [1024 8bit quantized features]
        }
      }
      feature: {
        bytes_list: {
          value: [1024 8bit quantized features]
        }
      }
      ... # Repeated for every second of the video, up to 300
  }
  feature_list: {
    key  : "audio"
    value: {
      feature: {
        bytes_list: {
          value: [128 8bit quantized features]
        }
      }
      feature: {
        bytes_list: {
          value: [128 8bit quantized features]
        }
      }
    }
    ... # Repeated for every second of the video, up to 300
  }

}
The total size of the frame-level features is 1.71 Terabytes. They are broken into 4096 shards which can be subsampled to reduce the dataset size.

To download the frame-level features, you have the following options:
  • Manually download all 4096 shards from the frame-level training, frame-level validation, and the frame-level test partitions. You may also find it useful to download a handful of shards (see details below), start developing your code against those shards, and in conjunction kick-off the larger download.
  • Use our python download script. This assumes that you have python and curl installed.

    To download the Frame-level dataset using the download script, navigate your terminal to a directory where you would like to download the data. For example:
    mkdir -p ~/data/yt8m; cd ~/data/yt8m
    Then download the training and validation data. Note: Make sure you have 1.71TB of free disk space to store the frame-level feature files. Download the entire dataset as follows:
    curl data.yt8m.org/download.py | partition=1/frame_level/train mirror=us python
    curl data.yt8m.org/download.py | partition=1/frame_level/validate mirror=us python
    curl data.yt8m.org/download.py | partition=1/frame_level/test mirror=us python
    The above uses the us mirror. If you are located in Europe or Asia, please swap the mirror flag us with eu or asia, respectively.

    To download 1/100-th of the training data from the US use:
    curl data.yt8m.org/download.py | shard=1,100 partition=1/frame_level/train mirror=us python

Video-level features dataset

Video-level features are stored as tensorflow.Example protocol buffers. A tensorflow.Example proto is reproduced here in text format:
features: {
  feature: {
    key  : "video_id"
    value: {
      bytes_list: {
        value: [YouTube video id string]
      }
    }
  }
  feature: {
    key  : "labels"
    value: {
      int64_list: {
        value: [1, 522, 11, 172] # The meaning of the labels can be found here.
      }
    }
  }
  feature: {
    key  : "mean_rgb" # Average of all 'rgb' features for the video
    value: {
      float_list: {
        value: [1024 float features]
      }
    }
  }
  feature: {
    key  : "mean_audio" # Average of all 'audio' features for the video
    value: {
      float_list: {
        value: [128 float features]
      }
    }
  }
}
The total size of the video-level features is 31 Gigabytes. They are broken into 4096 shards which can be subsampled to reduce the dataset size. Similar to above, we offer two download options:
  • Manually download all 4096 shards from the video-level training, video-level validation, and the video-level test partitions. You may also find it useful to download a handful of shards, start developing your code against those shards, and in conjunction kick-off the larger download.
    If you are located in Europe or Asia, please replace us in the URL with eu or asia, respectively to speed up the transfer of the files.
  • Use our python download script. For example:
    mkdir -p ~/data/yt8m_video_level; cd ~/data/yt8m_video_level

    curl data.yt8m.org/download.py | partition=1/video_level/train mirror=us python
    curl data.yt8m.org/download.py | partition=1/video_level/validate mirror=us python
    curl data.yt8m.org/download.py | partition=1/video_level/test mirror=us python
    If you are located in Europe or Asia, please swap the domain prefix us with eu or asia, respectively.

    To download 1/100-th of the training data from the US use:
    curl data.yt8m.org/download.py | shard=1,100 partition=1/video_level/train mirror=us python
Google Google About Google Privacy Terms Feedback