The 3rd Workshop on YouTube-8M Large-Scale Video Understanding

October 28^th, 2019, Seoul, Korea (ICCV'19)

▲ Explore Korean Food on YouTube-8M!

Introduction

Many recent breakthroughs in machine learning and machine perception have come from the availability of large labeled datasets, such as ImageNet, which has millions of images labeled with thousands of classes, and has significantly accelerated research in image understanding. Google announced the YouTube-8M dataset in 2016, which spans millions of videos labeled with thousands of classes, with the hope that it would spur similar innovation and advancement in video understanding. YouTube-8M represents a cross-section of our society, and was designed with scale and diversity in mind so that lessons we learn on this dataset can transfer to all areas of our lives, from learning, to communication, to entertainment. It covers over 20 broad domains of video content, including entertainment, sports, commerce, hobbies, science, news, jobs & education, health.

Continuing from the last year's challenge and workshop, we are excited to announce the 3rd Workshop on YouTube-8M Large-Scale Video Understanding, to be held on October 28, 2019, at the International Conference on Computer Vision (ICCV 2019) in Seoul, Korea. We invite researchers to participate in this large-scale video classification challenge and to report their results at the workshop, as well as to submit papers describing research, experiments, or applications based on YouTube-8M. The classification challenge will be hosted as a kaggle.com competition. We will feature $2,500 travel award for the 10 top-performing teams (details here).

Program

Time	Content	Presenter
9:00 - 9:05	Opening Remarks	Paul Natsev
9:05 - 9:20	Overview of 2019 YouTube-8M Dataset & Challenge	Challenge Orgnizers
Session 1
9:20 - 9:50	Invited Talk 1: Action Recognition and Prediction in Spacetime	Jitendra Malik
9:50 - 10:20	Invited Talk 2: Learning from Narrated Videos	Jean-Baptiste Alayrac
10:20 - 10:40	Coffee Break
Session 2
10:40 - 11:00	MediaPipe: A framework for building perception pipelines	Chris McClanahan
11:00 - 12:00	Oral Session 1 Logistic Regression is Still Alive and Effective:The 3rd YouTube 8M challenge solution of the IVUL-KAUST team Multi-attention Networks for Temporal Localization of Video-level Labels Soft-Label: A Strategy to Expand Dataset for Large-scale Fine-grained Video Classification	IVUL-KAUST (#11) Locust (#13) opsz (#10)
12:00 - 2:00	Lunch on your own
Session 3
2:00 - 2:30	Invited Talk 3: Detecting Activities with Less	Cees Snoek
2:30 - 3:00	Invited Talk 4: From video-level to fine-grained recognition and retrieval of interactions	Dima Damen
3:00 - 4:00	Oral Session 2 MOD: A Deep Mixture Model with Online Knowledge Distillation for Large Scale Video Temporal Concept Localization Cross-Class Relevance Learning for Information Fusion in Temporal Concept Localization Noise Learning for Weakly Supervised Segment Classification in Video	RLin (#3) Layer6 AI (#1) zhangzhaoyu (#8)
4:00 - 4:30	Coffee Break
Session 4
4:30 - 6:00	Poster Session	All accepted papers

Accepted Papers

Introduction by Organizers

The 3rd YouTube-8M Large-Scale Video Understanding Challenge, MediaPipe [slide]
Joonseok Lee, Chris McClanahan

Classification Challenge Track

Cross-Class Relevance Learning for Information Fusion in Temporal Concept Localization [pdf]
Junwei Ma, Satya Krishna Gorti, Maksims Volkovs, Ilya Stanevich, Guangwei Yu (Layer6 AI, ranked at 1)
Exploring the Consistency of Segment-level and Video-level Predictions for Improved Temporal Concept Localization in Videos [pdf]
Zejia Weng, Rui Wang, Yu-Gang Jiang (BigVid Lab, ranked at 2)
MOD: A Deep Mixture Model with Online Knowledge Distillation for Large Scale Video Temporal Concept Localization [pdf] [slide]
Rongcheng Lin, Jing Xiao, Jianping Fan (RLin, ranked at 3)
A segment-level classification solution to the 3rd YouTube-8M Video Understanding Challenge [pdf]
Shubin Dai (bestfitting, ranked at 4)
Towards Localizing Temporal Events in Large-scale Video Data [pdf]
Miha Skalic, Mikel Bober-Irizar, David Austin (Last Top GB Model, ranked at 5)
Weakly Supervised EM Process For Temporal Localization Within Video [pdf]
Jin Xia, Jie Shao, Cewu Lu, Changhu Wang (ByteVideo, ranked at 6)
Temporal Concept Localization within Video using a Mixture of Context-Aware and Context-Agnostic Segment Classifiers [pdf]
Shih-hsuan Lee (Ceshine, ranked at 7)
Noise Learning for Weakly Supervised Segment Classification in Video [pdf]
Zhaoyu Zhang, Xiang Wu, Jianfeng Dong, Yuan He, Hui Xue, Feng Mao (zhangzhaoyu, ranked at 8)
BERT for Large-scale Video Segment Classification with Test-time Augmentation [pdf]
Tianqi Liu, Qizhan Shao (TM, ranked at 9)
Soft-Label: A Strategy to Expand Dataset for Large-scale Fine-grained Video Classification [pdf]
Han Kong, Yubin Wu, Kang Yin, Feng Guo, Huaiqin Dong, Yulu Wang (opsz, ranked at 10)
Logistic Regression is Still Alive and Effective:The 3rd YouTube 8M Challenge Solution of the IVUL-KAUST team [pdf]
Merey Ramazanova, Chen Zhao, Humam Alwassel, Mengmeng Frost Xu, Sara Rojas Martinez, Bernard Ghanem, Fabian Caba (IVUL-KAUST, ranked at 11)
Boosting Up Segment-level Video Classification Performance with Label Correlation and Reweighting [pdf]
Wang Lei (UnitedAi, ranked at 12)
Multi-attention Networks for Temporal Localization of Video-level Labels [pdf] [slide]
Lijun Zhang, Srinath Nizampatnam, Ahana Gangopadhyay, Marcos Conde (Locust, ranked at 13)
A Temporal Concept Localization Method Integrating Attention Mechanism [pdf]
Jupan Li, Borui Li, Zhengdong Li, Guanglei Zhang, Jinchao Xia (eHualu, ranked at 27)
Feed Forward Neural Network Architectures for Temporal Concept Localization [pdf]
Ramana Anandakumar, Jeya Anandakumar, Megala N. (rand, ranked at 95)

Invited Talks

Invited Talk 1: Action Recognition and Prediction in Spacetime
Jitendra Malik University of California at Berkeley

Invited Talk 2: Learning from Narrated Videos [slide]
Jean-Baptiste Alayrac DeepMind

In this talk, I will emphasize the importance of being able to learn useful visual representations with less supervision, especially in the context of video understanding. I will argue that narrated instructional videos are a promising source of data for that purpose as they (i) typically depict highly structured human tasks, (ii) are multimodal (video and language) and (iii) are readily available at scale on YouTube. I will then describe two of our recent results that take advantage of this observation. First, we investigate learning visual models for the steps of ordinary tasks using an ordered list of textual steps as our only source of supervision. We show that when we allow our visual models to be compositional, they can share knowledge between different tasks and in turn obtain better localization performance. Second, I will introduce our recent dataset, HowTo100M. HowTo100M depicts more than 20K human tasks, contains more than 100 million narrated video clips (where the narration is obtained with Automatic Speech Recognition) and was collected without any manual annotation. Finally, we demonstrate that a text-video embedding trained on HowTo100M leads to state-of-the-art results for text-to-video retrieval and action localization tasks.

Invited Talk 3: Detecting Activities with Less [slide]
Cees Snoek University of Amsterdam

Spatio-temporal detection of human activities in video is demanding in terms of labels and computation. In this talk, I will present recent work that attacks these problems. First, I will discuss supervision with less. While normally used exclusively for inference, we show unsupervised spatio-temporal proposals can also be leveraged during training when guided by a sparse set of point annotations. We introduce an overlap measure between points and proposals into a new objective for Multiple Instance Learning. During inference, we introduce pseudo-points, visual cues from videos, that automatically guide the selection of proposals. Then I will zoom in on representation learning with less. The two-stream detection network based on RGB and flow provides state-of-the-art accuracy at the expense of a large model-size and heavy computation. We propose to embed RGB and optical-flow into a single two-in-one stream network with new layers. A motion condition layer extracts motion information from flow images, which is leveraged by the motion modulation layer to generate transformation parameters for modulating the low-level RGB features. The method is easily embedded in existing appearance- or two-stream action detection networks, and trained end-to-end. Experiments on several video datasets demonstrate the ability to detect human activities with less labels and computation, while maintaining competitive accuracy.

Invited Talk 4: From video-level to fine-grained recognition and retrieval of interactions [slide]
Dima Damen University of Bristol

This talk aims to argue for a fine-grained perspective onto human-object interactions, from video sequences. I will present approaches for dual-learning [CVPR 2019, ICCVW 2019] as well as multi-modal approaches using vision, audio and language [ICCV 2019, BMVC 2019]. These works use fine-gained datasets, including our own EPIC-KITCHENS [ECCV 2018], the recently released largest dataset of object interactions in people’s homes, recorded using wearable cameras. More details at http://dimadamen.github.io.

Call for Participation

We are soliciting participation for two different tracks as last year:

Classification Challenge Track

This track will be organized as a Kaggle competition for large-scale video classification based on the YouTube-8M dataset. Researchers are invited to participate in the classification challenge by training a model on the public YouTube-8M training and validation sets and submitting video classification results on a blind test set.

In this year, we update the dataset to include segment-level human-labeled ground truth for a subset of videos in the dataset. The granularity of the labeling is therefore increased from one per video, to one per 5 seconds. Each video will again come with time-localized frame-level features so classifier predictions can be made at segment-level granularity. Unlike the previous editions of this challenge, the competition task will focus on temporal localization within a video. Segment/frame-level annotation or temporal localization is an important challenge in video understanding with various applications, such as searching within a video or discovering interesting action moments. In practice, segment-level annotation data is very hard and expensive to collect at large scale, making this problem very difficult. Thus, the main focus of this year's challenge is how to leverage noisy video-level labels and a small subset of segment-level calibration set jointly in order to better annotate and temporally localize concepts of interest. We will evaluate submissions based on human-labeled data for the first time. There is no model size restriction this year, although we encourage participants to train lighter single model instead of heavy ensembles. Open-source TensorFlow code, implementing a few baseline segment-level classification models for YouTube-8M, along with training and evaluation scripts, is available at GitHub. For details on getting started with local or cloud-based training, please see our README and the getting started guide on Kaggle. Results will be scored by a Kaggle evaluation server and published on a public leaderboard, updated live for all submissions (scored on a portion of the test set), along with a final (private) leaderboard, published after the competition is closed (scored on the rest of the test set). Top-ranking submissions in the challenge leaderboard will be invited to the workshop to present their method. Please see details on the Kaggle competition page.

We encourage participants to explore the following topics (non-exhaustive list) and to submit papers to this workshop discussing their approaches and result analysis (publication is also a requirement for prize eligibility on the Kaggle competition):

large-scale multi-label video classification / annotation
temporal / sequence modeling and pooling approaches for video
temporal attention modeling mechanisms
video representation learning (e.g., classification performance vs. video descriptor size)
multi-modal (audio-visual) modeling and fusion approaches
learning from noisy / incomplete ground-truth labels
score calibration and ranking across classes and videos
multiple-instance learning
transfer learning, domain adaptation, generalization (across the 24 top-level verticals)
scale: performance vs. training data & compute quantity (#videos, #frames, #CPUs, etc.)

General Paper Track

Researchers are invited to submit any papers involving research, experimentation, or applications on the YouTube-8M dataset. The paper need not to tackle this year's task (segment-level video annotation) for this track. We welcome submissions with other tasks with the dataset, including our previous challenge topic (video-level annotation). Paper submissions will be reviewed by the workshop organizers and accepted papers will be invited for oral or poster presentations at the workshop.

We encourage participants to explore any relevant topics of interest using YouTube-8M dataset, including but not limited to:

All of the topics listed above (with or without participation in the Kaggle challenge)
Large-scale video recommendations, search, and discovery
Joining YouTube-8M with other publicly available resources / datasets (e.g., exploring YouTube video metadata, testing public models / features on this dataset, etc.)
Dataset visualization / browsing interfaces
Label augmentation and cleanup, active learning
Open-source contributions to the community targeting this dataset

Submission to this track does not require participation in the challenge task, but must be related to the YouTube-8M dataset. We welcome new applications that we didn't think of! Paper submissions are expected to have 8 to 12 pages (no strict page limit) in the ICCV formatting style. Demo paper submissions are also welcome.

This year, the submission system does not distinguish the two tracks. If you are submitting to the general paper track, please indicate "N/A" in the Kaggle team name section in the submission questionaire. For submissions to the classification challenge track, this field is required.

Awards

Each of the top 10 ranked teams (on the final private leaderboard) will receive $2,500 per team as a travel award to attend the ICCV 2019 Conference. Prize eligibility requires adherence to the Competition Rules. Winners must submit and present a paper describing their approach to the workshop to be eligible for this award.

Submission Instructions

Formatting

All submissions will be handled electronically, through our CMT submission site. Papers are limited to 8 pages, including figures and tables, in the ICCV style. Additional pages containing only cited references are allowed. Please refer to the files in the Author Guidelines page at the ICCV 2019 website for formatting instructions.

Review Process

Submitted papers will be reviewed by the organizing committee members, and a subset will be selected for oral or poster presentation. Submissions will be evaluated in terms of potential impact (e.g. performance on the classification challenge), technical depth & scalability, novelty, and presentation.

Blind Submission / Dual Submission / Page Limit Policies

We do not require blind submissions---author names and affiliations may be shown. We do not restrict submissions of relevant work that is under review or will be published elsewhere. Previously published work (except for on previous YouTube-8M workshops) is also acceptable as long as it is retargeted towards YouTube-8M. Papers are limited to 8 pages, including figures and tables, but excluding references. The accepted papers will be linked on the workshop website and will appear in the ICCV proceedings through CVF open access archive.

How to Submit

Create an account at our CMT submission site. If you do not receive a confirmation email, you may reset password before login.
Submit your paper through the CMT site. You will have to input your Kaggle team name if you are submitting your approach used in the Kaggle competition.
Submission deadline for this form is September 20, 2019, 11:59 PM (UTC/GMT).

Important Dates

According to the deadline extension of the challenge, we run two rounds of paper submission schedule. The first round is same as before; for those who want to confirm acceptance before scheduling the trip to ICCV, we encourage you to submit a paper based on your intermediate result by 9/20. We will notify the result by 9/24. Paper submission is open until 10/18, one week after the competition closing date. The top 10 teams must submit a paper by this due date to be eligible for the prize. We strongly encourage the winners to present in person, but if it is hard (for instance, due to visa requirement), it is possible to arrange remote presentation either live or through a video recording. Camera-ready deadline for all accepted papers is 10/25, through the CMT submission site. Optionally, authors of the accepted papers on the 1st round may choose to officially publish the paper by submitting the camera-ready by 9/27. We will notify with detailed instructions later.

Paper submission deadline (1st round)	September 20, 2019 (11:59 PM UTC/GMT)
Paper Acceptance Notification (1st round)	September 24, 2019
Paper camera-ready deadline (1st round)	September 27, 2019
Challenge submission deadline	October 11, 2019
Paper submission deadline (2nd round) & Winners' obligations deadline	October 18, 2019 (11:59 PM UTC/GMT)
Paper Acceptance Notification (2nd round) & Challenge Winners Confirmation	October 22, 2019
Paper camera-ready deadline (2nd round)	October 25, 2019
Workshop date (at ICCV'19)	October 28, 2019

Organizers

General Chairs

Apostol (Paul) Natsev

Cordelia Schmid

Rahul Sukthankar

Program Chairs

Joonseok Lee

George Toderici

Challenge Organizers

Ke Chen	Julia Elliott	Nisarg Kothari	Hanhan Li
Joe Yue-Hei Ng	Sobhan Naderi Parizi	Walter Reade	David Ross
Javier Snaider	Balakrishnan Varadarajan	Sudheendra Vijayanarasimhan	Yexin Wang
Zheng Xu	Wing Cheuk