Dataset Explore Download About

CVPR'17 Workshop on YouTube-8M Large-Scale Video Understanding


Many recent breakthroughs in machine learning and machine perception have come from the availability of large labeled datasets, such as ImageNet, which has millions of images labeled with thousands of classes, and has significantly accelerated research in image understanding. Google recently announced the YouTube-8M dataset, which spans millions of videos labeled with thousands of classes, and we hope it will spur similar innovation and advancement in video understanding. YouTube-8M represents a cross-section of our society, and was designed with scale and diversity in mind so that lessons we learn on this dataset can transfer to all areas of our lives, from learning, to communication, to entertainment. It covers over 20 broad domains of video content, including entertainment, sports, commerce, hobbies, science, news, jobs & education, health.

We are excited to announce the CVPR 2017 Workshop on YouTube-8M Large-Scale Video Understanding, to be held July 26, 2017, at the 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017) in Honolulu, Hawaii. We invite researchers to participate in a large-scale video classification challenge and to report their results at this workshop, as well as to submit papers describing research, experiments, or applications based on YouTube-8M. The classification challenge will be hosted as a kaggle.com competition, sponsored by Google Cloud, and will feature a $100,000 prize pool for the top performers (details here). In order to enable wider participation in the competition, Google Cloud is also offering limited compute credits so participants can optionally do model training and exploration using the Google Cloud Machine Learning platform (this is for the convenience of participants and not a requirement for participation).


Time Content Presenter
9:00 - 9:05 Opening Remarks Paul Natsev
9:05 - 9:30 Overview of YouTube-8M Dataset, Challenge Challenge Orgnizers
Session 1
9:30 - 10:00 Invited Talk 1: Video understanding: what we understood and what we still need to learn Alex Hauptmann
10:00 - 10:30 Invited Talk 2: Structured Models for Human Action Recognition Cordelia Schmid
10:30 - 10:45 Coffee Break
Session 2
10:45 - 12:00 Oral Session 1
  • Aggregating Frame-level Features for Large-Scale Video Classification
  • The Monkeytyping Solution to the YouTube-8M Video Understanding Challenge
  • Deep Learning Methods for Efficient Large Scale Video Labeling
  • Learnable pooling with Context Gating for video classification
  • Temporal Modeling Approaches for Large-scale Youtube-8M Video Understanding

  • Team FDT (#4)
  • Team monkeytyping (#2)
  • Team You8M (#5)
  • Team WILLOW (#1)
  • Team offline (#3)
12:00 - 1:00 Lunch on your own
Session 3
1:00 - 1:30 Invited Talk 3: Learning from Synthetic Humans Ivan Laptev
1:30 - 2:00 Invited Talk 4: Tube Convolutional Neural Network (T-CNN) for Action Detection in Videos Mubarak Shah
2:00 - 2:30 YouTube-8M Classification Challenge Summary, Organizers' Lightning Talks Challenge Organizers
2:30 - 3:30 Poster Session Participants
3:30 - 3:45 Coffee Break
Session 4
3:45 - 5:00 Oral Session 2
  • Learning Features by Watching Objects Move
  • Cultivating DNN Diversity for Large Scale Video Labelling
  • Truly Multi-modal YouTube-8M Video Classification with Video, Audio, and Text
  • Encoding Video and Label Priors for Multi-label Video Classification on YouTube-8M dataset
  • The YouTube-8M Kaggle Competition: Challenges and Methods

  • (General Paper Track)
  • Team Yeti (#7)
  • Team DL2.0 (#22)
  • Team SNUVL X SKT (#8)

  • Team Samaritan (#10)
5:00 - 5:20 Closing and Award Ceremony Paul Natsev

Accepted Papers

Classification Challenge Track

  • Learnable pooling with Context Gating for video classification [pdf] [ArXiv] [slide]
    Antoine Miech, Ivan Laptev, Josef Sivic (Team WILLOW, ranked at 1)
  • The Monkeytyping Solution to the YouTube-8M Video Understanding Challenge [pdf] [ArXiv] [slide]
    He-Da Wang, Teng Zhang, Ji Wu (Team monkeytyping, ranked at 2)
  • Temporal Modeling Approaches for Large-scale Youtube-8M Video Understanding [pdf] [ArXiv]
    Fu Li, Chuang Gan, Xiao Liu, Yunlong Bian, Xiang Long, Yandong Li, Zhichao Li, Jie Zhou, Shilei Wen (Team offline, ranked at 3)
  • Aggregating Frame-level Features for Large-Scale Video Classification [pdf] [slide]
    Shaoxiang Chen, Xi Wang, Yongyi Tang, Xinpeng Chen, Zuxuan Wu, Yu-Gang Jiang (Team FDT, ranked at 4)
  • Deep Learning Methods for Efficient Large Scale Video Labeling [pdf] [ArXiv] [slide]
    Miha Skalic, Marcin Pekalski, Xingguo E. Pan (Team You8M, ranked at 5)
  • UTS submission to Google YouTube-8M Challenge 2017 [pdf]
    Linchao Zhu, Yanbin Liu, Yi Yang (Team Rankyou, ranked at 6)
  • Cultivating DNN Diversity for Large Scale Video Labelling [pdf]
    Mikel Bober-Irizar, Sameed Husain, Eng-Jon Ong, Miroslaw Bober (Team Yeti, ranked at 7)
  • Encoding Video and Label Priors for Multi-label Video Classification on YouTube-8M dataset [pdf] [ArXiv] [slide]
    Seil Na, Jisung Kim, YoungJae Yu, Sangho Lee, Gunhee Kim (Team SNUVL X SKT, ranked at 8)
  • The YouTube-8M Kaggle Competition: Challenges and Methods [pdf] [ArXiv] [slide]
    Haosheng Zou, Kun Xu, Jialian Li (Team Samaritan, ranked at 10)
  • Truly Multi-modal YouTube-8M Video Classification with Video, Audio, and Text [pdf]
    Zhe Wang, Kingsley Kuan, Mathieu Ravaut, Gaurav Manek, Sibo Song, Fang Yuan Kim Seokhwan, Nancy Chen, Luis Fernando D'Haro Enriquez, Luu Anh Tuan, Hongyuan Zhu, Zeng Zeng, Ngai Man Cheung, Georgios Piliouras, Jie Lin, Vijay Chandrasekhar (Team DL2.0, ranked at 22)
  • YouTube-8M Video Understanding Challenge Approach and Applications [pdf] [ArXiv]
    Edward Chen (Team Xers, ranked at 47)
  • Large-scale Video Classification guided by Batch Normalized LSTM Translator [pdf] [ArXiv]
    Jae Hyeon Yoo (Team J, ranked at 206)
  • Large-Scale YouTube-8M Video Understanding with Deep Neural Networks [pdf] [ArXiv]
    Manuk Akopyan, Eshsou Khashba (Team МанукАкопян, ranked at 259)
  • Video Representation Learning and Latent Concept Mining for Large-scale Multi-label Video Classification [pdf] [ArXiv]
    Po-Yao Huang, Ye Yuan, Zhenzhong Lan, Lu Jiang, Alexander G. Hauptmann (Team Informedia lab)
  • Hierarchical Deep Recurrent Architecture for Video Understanding [pdf] [ArXiv]
    Luming Tang, Boyang Deng, Haiyu Zhao, Shuai Yi (Team Never Lucky)
  • An Effective Way to Improve YouTube-8M Classification Accuracy in Google Cloud Platform [pdf] [ArXiv]
    Zhenzhen Zhong, Shujiao Huang, Cheng Zhan, Licheng Zhang, Zhiwei Xiao, Chang-Chun Wang, Pei Yang (Team Cheng Zhan)

General Paper Track

  • Learning Features by Watching Objects Move [pdf] [ArXiv]
    Deepak Pathak, Ross Girshick, Piotr Dollár, Trevor Darrell, Bharath Hariharan
  • Hierarchical Label Inference for Video Classification [pdf]
    Nelson Nauata, Jonathan Smith, Greg Mori

Invited Talks

Invited Talk 1: Video understanding: what we understood and what we still need to learn
Alex Hauptmann Carnegie Mellon University

The talk will reflect on our work on representation learning with different network types. I will also discuss insights into the role of accuracy of labels in limiting learning, with respect to latent concept learning, and self-paced learning.
In the talk I hope to make an attempt to analyze our partially failed attempts at using temporal information, which will conclude with the
overview of our work on YouTube-8M. Beyond these lessons, the remainder of the talk will look at a bigger picture, beyond labels for images, with the conclusion that classification is not the same as understanding. Neither is recognition. Captioning is also not understanding. I will try to define the components of "deeper" understanding and alternative approaches to research in video.

Invited Talk 2: Structured models for human action recognition
Cordelia Schmid INRIA Research

In this talk, we present some recent results for human action recognition in videos. We, first, introduce a pose-based convolutional neural network descriptor for action recognition, which aggregates motion and appearance information along tracks of human body parts. We also present an approach for extracting such human pose in 2D and 3D. Next, we propose an approach for spatio-temporal action
localization, which detects and scores CNN action proposals at a frame as well as at a tubelet level and then tracks high-scoring proposals in the video. Action are localized in time with an LSTM at the track level. Finally, we show how to extend this type of method to weakly supervised learning of actions, which allows to scale to large amounts of data without manual annotation.

Invited Talk 3: Learning from Synthetic Humans
Ivan Laptev INRIA Research

Estimating human pose, shape, and motion from images and videos are fundamental challenges with many applications. Recent advances in 2D human pose estimation use large amounts of manually-labeled training data for learning convolutional neural networks (CNNs). Such data is time consuming to acquire and difficult to extend. Moreover, manual labeling of 3D pose, depth and motion is impractical. In this work we present SURREAL (Synthetic hUmans foR REAL tasks): a new large-scale dataset with synthetically-generated but realistic images of people rendered from 3D sequences of human motion capture data. We generate more than 6 million frames
together with ground truth pose, depth maps, and segmentation masks. We show that CNNs trained on our synthetic dataset allow for accurate human depth estimation and human part segmentation in real RGB images. Our results and the new dataset open up new possibilities for advancing person analysis using cheap and large-scale synthetic data.

Invited Talk 4: Tube Convolutional Neural Network (T-CNN) for Action Detection in Videos
Mubarak Shah University of Central Florida

Deep learning has been demonstrated to achieve excellent results for image classification and object detection. However, the impact of deep learning on video analysis (e.g. action detection and recognition) has been limited due to complexity of video data and lack of annotations. Previous convolutional neural networks (CNN) based video action detection approaches usually consist of two major steps: frame-level action proposal generation and association of proposals across frames. Also, most of these methods employ two-stream CNN framework to handle spatial and temporal feature separately. In this paper, we propose an end-to-end deep network called Tube Convolutional Neural Network (T-CNN) for action detection in videos. The proposed architecture is a unified deep network that is able to recognize and localize action based on 3D convolution features. A video is first divided into equal length clips and next for each clip a set of tube proposals are generated based on 3D Convolutional Network (ConvNet) features. Finally, the tube proposals of different clips are linked together employing network flow and spatio-temporal action detection is performed using these linked video proposals. Extensive experiments on several video datasets demonstrate the superior performance of T-CNN for classifying and localizing actions in both trimmed and untrimmed videos compared to state-of-the-arts.

Call for Participation

We are soliciting participation for two different tracks:

Classification Challenge Track

This track will be organized as a Kaggle competition for large-scale video classification based on the YouTube-8M dataset. Researchers are invited to participate in the classification challenge by training a model on the public YouTube-8M training and validation sets and submitting video classification results on a blind test set. Open-source TensorFlow code, implementing a few baseline classification models for YouTube-8M, along with training and evaluation scripts, is available at Github. For details on getting started with local or cloud-based training, please see our README and the getting started guide on Kaggle. Results will be scored by a Kaggle evaluation server and published on a public leaderboard, updated live for all submissions (scored on a portion of the test set), along with a final (private) leaderboard, published after the competition is closed (scored on the rest of the test set). Top-ranking submissions in the challenge leaderboard will be invited to the workshop to present their method as an oral talk. Please see details on the Kaggle competition page.

We encourage participants to explore the following topics (non-exhaustive list) and to submit papers to this workshop discussing their approaches and result analysis (publication is also a requirement for prize eligibility on the Kaggle competition):

  • large-scale multi-label video classification / annotation
  • temporal / sequence modeling and pooling approaches for video
  • temporal attention modeling mechanisms
  • video representation learning (e.g., classification performance vs. video descriptor size)
  • multi-modal (audio-visual) modeling and fusion approaches
  • learning from noisy / incomplete ground-truth labels
  • score calibration and ranking across classes and videos
  • multiple-instance learning (training frame-/segment-level models from video labels)
  • transfer learning, domain adaptation, generalization (across the 24 top-level verticals)
  • scale: performance vs. training data & compute quantity (#videos, #frames, #CPUs, etc.)

General Paper Track

Researchers are invited to submit any papers involving research, experimentation, or applications on the YouTube-8M dataset. Paper submissions will be reviewed by the workshop organizers and accepted papers will be invited for oral or poster presentations at the workshop.

We encourage participants to explore any relevant topics of interest using YouTube-8M dataset, including but not limited to:

  • All of the topics listed above (with or without participation in the Kaggle challenge)
  • Large-scale video recommendations, search, and discovery
  • Joining YouTube-8M with other publically available resources / datasets (e.g., exploring YouTube video metadata, testing public models / features on this dataset, etc.)
  • Dataset visualization / browsing interfaces
  • Label augmentation and cleanup, active learning
  • Open-source contributions to the community targeting this dataset

Submission to this track does not require participation in the challenge task, but must be related to the YouTube-8M dataset. We welcome new applications that we didn't think of! Paper submissions are expected to have 4 to 8 pages (no strict page limit) in the CVPR formatting style. Demo paper submissions are also welcome.


Google Cloud sponsors awards for the top-performing challenge participants, who agree to:

  1. Describe their challenge approach(es) into a paper submission at this workshop, and
  2. Open-source an implementation of the above (both training and inference code) that can reproduce their best results (within a reasonable margin).

Note that publication and open-sourcing are not required to participate in the challenge---we welcome all participation, and will score and rank all submissions, regardless of how they are generated, or whether they are published. However, only submissions that meet the above requirements will be eligible for award recognition and cash prizes.

The total prize pool for this competition is $100,000. For more details on prizes and eligibility, refer to the Kaggle competition pages.

Congratulations to winners!

The 1st place: Team WILLOW

The 2nd place: Team monkeytyping

The 4th place: Team FDT

The 5th place: Team You8M

Submission Instructions


All submissions will be handled electronically; we request a publicly available URL, where we can access the paper. We recommend uploading your paper on arXiv, but other paper hosting arrangements are acceptable (e.g, technical report at your institution, your own website, etc.). There is no strict limit on the number of pages---we recommend 4 to 8 pages, in the CVPR formatting style. Submission of supplementary material will not be reviewed or considered. Please refer to the files in the Author Guidelines page at the CVPR 2017 website for formatting instructions.

Review Process

Submitted papers will be reviewed by the organizing committee members, and a subset will be selected for oral or poster presentation. Submissions will be evaluated in terms of potential impact (e.g. performance on the classification challenge), technical depth & scalability, novelty, and presentation.

Blind Submission / Dual Submission / Page Limit Policies

We do not require blind submissions---author names and affiliations may be shown. We do not restrict submissions of relevant work that is under review or will be published elsewhere. Previously published work is also acceptable as long as it is retargeted towards YouTube-8M. There is no strict page limit but we encourage 4 to 8 page submissions. The accepted papers will be linked on the workshop website and will not appear in the official CVPR proceedings.

How to Submit

  1. Fill out this form.
  2. Submission deadline for this form is June 16, 2017, 11:59 PM (UTC/GMT).

Important Dates

Challenge Submissions Deadline June 2, 2017
Paper Submission and Open-Sourcing Deadline June 28, 2017 (Extended)
Paper Acceptance & Awards Notification June 30, 2017
Paper Camera-Ready Deadline July 14, 2017
Workshop date (co-located with CVPR'17) July 26, 2017

All deadlines are at 11:59 PM UTC/GMT.



If you have any questions, please email us at yt8m-challenge@google.com or use the YouTube-8M Users Group.
Google Google About Google Privacy Terms Feedback