The 2nd Workshop on YouTube-8M Large-Scale Video Understanding

September 9^th, 2018, Munich, Germany (ECCV'18)

Introduction

Many recent breakthroughs in machine learning and machine perception have come from the availability of large labeled datasets, such as ImageNet, which has millions of images labeled with thousands of classes, and has significantly accelerated research in image understanding. Google announced the YouTube-8M dataset in 2016, which spans millions of videos labeled with thousands of classes, with the hope that it would spur similar innovation and advancement in video understanding. YouTube-8M represents a cross-section of our society, and was designed with scale and diversity in mind so that lessons we learn on this dataset can transfer to all areas of our lives, from learning, to communication, to entertainment. It covers over 20 broad domains of video content, including entertainment, sports, commerce, hobbies, science, news, jobs & education, health.

Continuing from the last year's challenge and workshop, we are excited to announce the 2nd Workshop on YouTube-8M Large-Scale Video Understanding, to be held on September 9, 2018, at the European Conference on Computer Vision (ECCV 2018) in Munich, Germany. We invite researchers to participate in this large-scale video classification challenge and to report their results at the workshop, as well as to submit papers describing research, experiments, or applications based on YouTube-8M. The classification challenge will be hosted as a kaggle.com competition. We will feature $5,000 travel award for the 5 top-performing teams (details here).

Program

Time	Content	Presenter
9:00 - 9:05	Opening Remarks	Paul Natsev
9:05 - 9:30	Overview of 2018 YouTube-8M Dataset & Challenge	Challenge Orgnizers
Session 1
9:30 - 10:00	Invited Talk 1: Human action recognition and the Kinetics dataset	Andrew Zisserman
10:00 - 10:30	Invited Talk 2: Segmental Spatio-Temporal Networks for Discovering the Language of Surgery	Rene Vidal
10:30 - 10:45	Coffee Break
Session 2
10:45 - 12:00	Oral Session 1 Building a Size Constrained Predictive Model for Video Classification Temporal Attention Mechanism with Conditional Inference for Large-Scale Multi-Label Video Classification Label Denoising with Large Ensembles of Heterogeneous Neural Networks NeXtVLAD: An Efficient Neural Network to Aggregate Frame-level Features for Large-scale Video Classification Non-local NetVLAD Encoding for Video Classification	Next top GB model (#1) KANU (#5) Samsung AI Moscow (#2) PhoenixLin (#3) YT8M-T (#4)
12:00 - 1:00	Lunch on your own
Session 3
1:00 - 1:30	Invited Talk 3: Learning video representations for physical interactions and language-based retrieval	Josef Sivic
1:30 - 2:00	Invited Talk 4: Towards Video Understanding at Scale	Manohar Paluri
2:00 - 2:15	Context-Gated DBoF Models for YouTube-8M [slide]	Paul Natsev
2:15 - 3:45	Poster Session	Participants
3:45 - 4:00	Coffee Break
Session 4
4:00 - 4:45	Oral Session 2 Learnable Pooling Methods for Video Classification Training compact deep learning models for video classification using circulant matrices Axon AI's Solution to the 2nd YouTube-8M Video Understanding Challenge	Deep Topology Alexandre Araujo (#36) Axon AI (#17)
4:45 - 5:00	Closing and Award Ceremony	Paul Natsev

Accepted Papers

Summary Paper by Organizers

The 2nd YouTube-8M Large-Scale Video Understanding Challenge [pdf] [slide]
Joonseok Lee, Apostol (Paul) Natsev, Walter Reade, Rahul Sukthankar, George Toderici

Classification Challenge Track

Building a Size Constrained Predictive Model for Video Classification [pdf] [slide]
Miha Skalic, David Austin (Team Next top GB model, ranked at 1)
Label Denoising with Large Ensembles of Heterogeneous Neural Networks [pdf] [slide]
Vladimir Aliev, Pavel Ostyakov, Roman Suvorov, Gleb Sterkin, Elizaveta Logacheva, Oleg Khomenko, Sergey Nikolenko (Team Samsung AI Center Moscow, ranked at 2)
NeXtVLAD: An Efficient Neural Network to Aggregate Frame-level Features for Large-scale Video Classification [pdf] [slide]
Rongcheng Lin, Jing Xiao, Jianping Fan (Team PhoenixLin, ranked at 3)
Non-local NetVLAD Encoding for Video Classification [pdf] [slide]
Yongyi Tang, Xing Zhang, Jingwen Wang, Shaoxiang Chen, Lin Ma, Yu-Gang Jiang (Team YT8M-T, ranked at 4)
Temporal Attention Mechanism with Conditional Inference for Large-Scale Multi-Label Video Classification [pdf] [slide]
Eun-Sol Kim, Jongseok Kim, Kyoung-Woon On, Yu-Jung Heo, Seong-Ho Choi, Hyun-Dong Lee, Byoung-Tak Zhang (Team KANU, ranked at 5)
Constrained-size Tensorflow Models for YouTube-8M Video Understanding Challenge [pdf]
Tianqi Liu, Bo Liu (Team Liu, ranked at 7)
Axon AI's Solution to the 2nd YouTube-8M Video Understanding Challenge [pdf] [slide]
Choongyeun Cho, Benjamin Antin, Sanchit Arora, Shwan Ashrafi, Peilin Duan, Dang The Huynh, Lee James, Hang Tuan Nguyen, Moji Solgi, Cuong Van Than (Team Axon AI, ranked at 17)
Training Compact Deep Learning Models for Video Classification using Circulant Matrices [pdf] [slide]
Alexandre Araujo, Benjamin Negrevergne, Yann Chevaleyre, Jamal Atif (Team Alexandre Araujo, ranked at 36)
Learning Video Features for Multi-Label Classification [pdf]
Shivam Garg (Team ShivamGarg, ranked at 83)
Learnable Pooling Methods for Video Classification [pdf]
Sebastian Kmiec, Juhan Bae (Team Deep Topolohy, disqualified from rank 38)
Approach for Video Classification with Multi-label on YouTube-8M Dataset [pdf]
Kwangsoo Shin, Junhyeong Jeon, Seungbin Lee (Team sogang-mm, disqualified from rank 44)

General Paper Track

Hierarchical Video Frame Sequence Representation with Deep Convolutional Graph Network [pdf]
Feng Mao, Xiang Wu, Hui Xue, Rong Zhang
Toward Good Practices for Multi-modal Fusion in Large-scale Video Classification [pdf]
Jinlai Liu, Zehuan Yuan, Changhu Wang

Invited Talks

Invited Talk 1: Human action recognition and the Kinetics dataset [slide]
Andrew Zisserman Oxford University & Google DeepMind

The Kinetics dataset is aiming to be the "ImageNet" for human action recogniton - a dataset that can be used to compare the performance of network architectures for action classification, and to pre-train networks for use on other tasks. This talk will cover three topics related to Kinetics: first, an illustrated review of the recently released Kinetics-600 dataset; second, demonstrations of its use in pre-training networks for other tasks, including temporal localization of actions in the AVA and Charades challenges; and third, an assessment of progress in network design for human action classification, in particular the importance or otherwise of temporal information in classifying actions.

Invited Talk 2: Segmental Spatio-Temporal Networks for Discovering the Language of Surgery
Rene Vidal Johns Hopkins University

Robotic Minimally Invasive Surgery (RMIS) has several advantages over traditional surgery, such as better precision, smaller incisions and reduced recovery time. However, the steep learning curve together with the lack of fair, objective, and effective criteria for judging the skills acquired by a trainee may reduce its benefits. Computer vision and machine learning techniques offer a unique opportunity for using RMIS recordings to model surgeon expertise and perform automatic skill assessment, gesture segmentation and classification. This talk will present dynamical system, conditional random field and deep learning based methods for decomposing kinematic and video data of a surgical task into a series of pre-defined surgical gestures, such as "insert a needle", "grab a needle", "position a needle", etc. and classifying the performance of a surgeon as "novice" or "expert".

Invited Talk 3: Learning video representations for physical interactions and language-based retrieval
Josef Sivic INRIA & Czech Technical University

In this talk I will describe our two recent works on learning video representations. First, I will outline an approach to estimate the 3D trajectory of a person manipulating an object given a single unconstrained video as input. This is achieved by visually recognizing the contacts between the person and the manipulated object and then estimating the 3D trajectory of the person and the object under the contact and kinematic constraints. Results will be shown on unconstrained Internet instructional videos depicting challenging person-object interactions. Second, I will describe a Mixture-of-Embedding-Experts model for learning joint text-video embeddings from heterogenous datasets comprising videos, sill images and audio. The proposed model outperforms previously reported methods on both text-to-video and video-to-text retrieval tasks on the MPII Movie Description and MSR-VTT datasets.

Invited Talk 4: Towards Video Understanding at Scale
Manohar Paluri Facebook

In this talk, I will touch on various challenges we face in building video understanding technology for Facebook scale and highlight the progress we have made over the past few years on developing models that can do better sequential reasoning, building datasets that define video understanding more concretely and/or stand as good proxy, move towards weakly supervised and self-supervised techniques while bridging the gap between them and fully supervised settings.

Call for Participation

We are soliciting participation for two different tracks:

Classification Challenge Track

This track will be organized as a Kaggle competition for large-scale video classification based on the YouTube-8M dataset. Researchers are invited to participate in the classification challenge by training a model on the public YouTube-8M training and validation sets and submitting video classification results on a blind test set. Unlike last year, you're challenged to produce a compact video classification model. Your model size must not exceed 1 GB (this is strictly enforced, through model upload). In addition, you are encouraged to have a small bottleneck layer. For example, can you encode the semantic labels by passing-through 100 bytes per video? Even though this is not strictly enforced, it is desired and if followed, should be clearly noted in the paper submission. Open-source TensorFlow code, implementing a few baseline classification models for YouTube-8M, along with training and evaluation scripts, is available at GitHub. For details on getting started with local or cloud-based training, please see our README and the getting started guide on Kaggle. Results will be scored by a Kaggle evaluation server and published on a public leaderboard, updated live for all submissions (scored on a portion of the test set), along with a final (private) leaderboard, published after the competition is closed (scored on the rest of the test set). Top-ranking submissions in the challenge leaderboard will be invited to the workshop to present their method. Please see details on the Kaggle competition page.

We encourage participants to explore the following topics (non-exhaustive list) and to submit papers to this workshop discussing their approaches and result analysis (publication is also a requirement for prize eligibility on the Kaggle competition):

large-scale multi-label video classification / annotation
temporal / sequence modeling and pooling approaches for video
temporal attention modeling mechanisms
video representation learning (e.g., classification performance vs. video descriptor size)
multi-modal (audio-visual) modeling and fusion approaches
learning from noisy / incomplete ground-truth labels
score calibration and ranking across classes and videos
multiple-instance learning (training frame-/segment-level models from video labels)
transfer learning, domain adaptation, generalization (across the 24 top-level verticals)
scale: performance vs. training data & compute quantity (#videos, #frames, #CPUs, etc.)

General Paper Track

Researchers are invited to submit any papers involving research, experimentation, or applications on the YouTube-8M dataset. Paper submissions will be reviewed by the workshop organizers and accepted papers will be invited for oral or poster presentations at the workshop.

We encourage participants to explore any relevant topics of interest using YouTube-8M dataset, including but not limited to:

All of the topics listed above (with or without participation in the Kaggle challenge)
Large-scale video recommendations, search, and discovery
Joining YouTube-8M with other publicly available resources / datasets (e.g., exploring YouTube video metadata, testing public models / features on this dataset, etc.)
Dataset visualization / browsing interfaces
Label augmentation and cleanup, active learning
Open-source contributions to the community targeting this dataset

Submission to this track does not require participation in the challenge task, but must be related to the YouTube-8M dataset. We welcome new applications that we didn't think of! Paper submissions are expected to have 8 to 12 pages (no strict page limit) in the ECCV formatting style. Demo paper submissions are also welcome.

Awards

Each of the top 5 ranked teams (on the final private leaderboard) will receive $5,000 per team as a travel award to attend the ECCV 2018 Conference. Prize eligibility requires adherence to the Competition Rules. Winners are the top 5 teams on the private leaderboard with verified model size less than 1GB.

Congratulations to winners!

Next top GB model (ranked at 1)
Samsung AI Center Moscow (ranked at 2)
PhoenixLin (ranked at 3)
YT8M-T (ranked at 4)
KANU (ranked at 5)

Teams that submitted non-eligible models (e.g, broken or over-sized) were either removed or moved to the right score with their best compliant model. The top 5 teams are the final awardees with verified models.

Teams that submitted a non-eligible model (>1GB) are still listed in final leaderboard, but not eligible for award.
Teams without model submission are removed from the final leaderboard, but we provide original rank and private leaderboard score for record.
Teams that violated any rule (e.g, multiple accounts) are removed from the leaderboard as well as from the list above.

The 1st place: Team Next top GB model	The 2nd place: Team Samsung AI Center Moscow
The 4th place: Team YT8M-T	The 5th place: Team KANU

Submission Instructions

Formatting

All submissions will be handled electronically, through our CMT submission cite. There is no strict limit on the number of pages---we recommend 8 to 12 pages, in the ECCV formatting style. Submission of supplementary material will not be reviewed or considered. Please refer to the files in the Author Guidelines page at the ECCV 2018 website for formatting instructions.

Review Process

Submitted papers will be reviewed by the organizing committee members, and a subset will be selected for oral or poster presentation. Submissions will be evaluated in terms of potential impact (e.g. performance on the classification challenge), technical depth & scalability, novelty, and presentation.

Blind Submission / Dual Submission / Page Limit Policies

We do not require blind submissions---author names and affiliations may be shown. We do not restrict submissions of relevant work that is under review or will be published elsewhere. Previously published work is also acceptable as long as it is retargeted towards YouTube-8M. There is no strict page limit but we encourage 8 to 12 page submissions. The accepted papers will be linked on the workshop website and will appear in the official ECCV proceedings.

How to Submit

Create an account at our CMT submission cite. If you do not receive a confirmation email, you may reset password before login.
Submit your paper through the CMT cite. You will have to input your Kaggle team name if you are submitting to the classification challenge track.
Submission deadline for this form is August 13, 2018, 11:59 PM (UTC/GMT).

Important Dates

Challenge submission (model upload) deadline	August 6, 2018
Paper submission deadline & Winners' obligations deadline	August 20, 2018 (Extended)
Paper Acceptance Notification & Challenge Winners Confirmation	August 22, 2018
Workshop date (co-located with ECCV'18)	September 9, 2018
Paper camera-ready deadline	September 30, 2018