Apostol (Paul) Natsev
Apostol (Paul) Natsev is a software engineer and manager in the video content analysis group at Google Research. Previously, he was a research staff member and manager of the multimedia research group at IBM Research from 2001 to 2011. He received a master's degree and a Ph.D. in computer science from Duke University, Durham, NC, in 1997 and 2001, respectively. Dr. Natsev's research interests span the areas of image and video analysis and retrieval, machine perception, large-scale machine learning and recommendation systems. He is an author of more than 80 publications and his research has been recognized with several awards.
Research Areas
Authored Publications
Google Publications
Other Publications
Sort By
Large Scale Video Representation Learning via Relational Graph Clustering
Hyodong Lee
Joe Yue-Hei Ng
Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Preview abstract
Representation learning is widely applied for various tasks on multimedia data, e.g., retrieval and search. One approach for learning useful representation is by utilizing the relationships or similarities between examples. In this work, we explore two promising scalable representation learning approaches on video domain. With hierarchical graph clusters built upon video-to-video similarities, we propose: 1) smart negative sampling strategy that significantly boosts training efficiency with triplet loss, and 2) a pseudo-classification approach using the clusters as pseudo-labels. The embeddings trained with the proposed methods are competitive on multiple video understanding tasks, including related video retrieval and video annotation. Both of these proposed methods are highly scalable, as verified by experiments on large-scale datasets.
View details
Large-Scale Training Framework for Video Annotation
Seong Jae Hwang
Balakrishnan Varadarajan
Ariel Gordon
Proc. of the 25th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), ACM (2019)
Preview abstract
Video is one of the richest sources of information available online but extracting deep insights from video content at internet scale is still an open problem, both in terms of depth and breadth of understanding, as well as scale. Over the last few years, the field of video understanding has made great strides due to the availability of large-scale video datasets and core advances in image, audio, and video modeling architectures. However, the state-of-the-art architectures on small scale datasets are frequently impractical to deploy at internet scale, both in terms of the ability to train such deep networks on hundreds of millions of videos, and to deploy them for inference on billions of videos. In this paper, we present a MapReduce-based training framework, which exploits both data parallelism and model parallelism to scale training of complex video models. The proposed framework uses alternating optimization and full-batch fine-tuning, and supports large Mixture-of-Experts classifiers with hundreds of thousands of mixtures, which enables a trade-off between model depth and breadth, and the ability to shift model capacity between shared (generalization) layers and per-class (specialization) layers. We demonstrate that the proposed framework is able to reach state-of-the-art performance on the largest public video datasets, YouTube-8M and Sports-1M, and can scale to 100 times larger datasets.
View details
Collaborative Deep Metric Learning for Video Understanding
Balakrishnan Varadarajan
Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, ACM (2018)
Preview abstract
The goal of video understanding is to develop algorithms that enable machines understand videos at the level of human experts. Researchers have tackled various domains including video classification, search, personalized recommendation, and more. However, there is a research gap in combining these domains in one unified learning framework. Towards that, we propose a deep network that embeds videos using their audio-visual content, onto a metric space which preserves video-to-video relationships. Then, we use the trained embedding network to tackle various domains including video classification and recommendation, showing significant improvements over state-of-the-art baselines. The proposed approach is highly scalable to deploy on large-scale video sharing platforms like YouTube.
View details
The Kinetics Human Action Video Dataset
Andrew Zisserman
Joao Carreira
Karen Simonyan
Will Kay
Brian Zhang
Chloe Hillier
Fabio Viola
Tim Green
Trevor Back
Mustafa Suleyman
arXiv (2017)
Preview abstract
We describe the DeepMind Kinetics human action video dataset. The dataset contains 400 human action classes, with at least 400 video clips for each action. Each clip lasts around 10s and is taken from a different YouTube video. The actions are human focussed and cover a broad range of classes including human-object interactions such as playing instruments, as well as human-human interactions such as shaking hands. We describe the statistics of the dataset, how it was collected, and give some baseline performance figures for neural network architectures trained and tested for human action classification on this dataset.
View details
Content-based Related Video Recommendations
Nisarg Kothari
Advances in Neural Information Processing Systems (NIPS) Demonstration Track (2016)
YouTube-8M: A Large-Scale Video Classification Benchmark
Nisarg Kothari
Balakrishnan Varadarajan
arXiv:1609.08675 (2016)
Preview abstract
Many recent advancements in Computer Vision are attributed to large datasets. Open-source software packages for Machine Learning and inexpensive commodity hardware have reduced the barrier of entry for exploring novel approaches at scale. It is possible to train models over millions of examples within a few days. Although large-scale datasets exist for image understanding, such as ImageNet, there are no comparable size video classification datasets.
In this paper, we introduce YouTube-8M, the largest multi-label video classification dataset, composed of ~8 million videos---500K hours of video---annotated with a vocabulary of 4803 visual entities. To get the videos and their (multiple) labels, we used the YouTube Data APIs. We filtered the video labels (Freebase topics) using both automated and manual curation strategies, including by asking Mechanical Turk workers if the labels are visually recognizable. Then, we decoded each video at one-frame-per-second, and used a Deep CNN pre-trained on ImageNet to extract the hidden representation immediately prior to the classification layer. Finally, we compressed the frame features and make both the features and video-level labels available for download. The dataset contains frame-level features for over 1.9 billion video frames and 8 million videos, making it the largest public multi-label video dataset.
We trained various (modest) classification models on the dataset, evaluated them using popular evaluation metrics, and report them as baselines. Despite the size of the dataset, some of our models train to convergence in less than a day on a single machine using the publicly-available TensorFlow framework. We plan to release code for training a basic TensorFlow model and for computing metrics.
We show that pre-training on large data generalizes to other datasets like Sports-1M and ActivityNet. We achieve state-of-the-art on ActivityNet, improving mAP from 53.8% to 77.8%. We hope that the unprecedented scale and diversity of YouTube-8M will lead to advances in video understanding and representation learning.
View details
Efficient Large Scale Video Classification
Balakrishnan Varadarajan
dblp computer science bibliography, http://dblp.org (2015) (to appear)
Preview abstract
Video classification has advanced tremendously over the recent years. A large part of the improvements in video classification had to do with the work done by the image classification community and the use of deep convolutional networks (CNNs) which produce competitive results with hand- crafted motion features. These networks were adapted to use video frames in various ways and have yielded state of the art classification results. We present two methods that build on this work, and scale it up to work with millions of videos and hundreds of thousands of classes while maintaining a low computational cost. In the context of large scale video processing, training CNNs on video frames is extremely time consuming, due to the large number of frames involved. We propose to avoid this problem by training CNNs on either YouTube thumbnails or Flickr images, and then using these networks' outputs as features for other higher level classifiers. We discuss the challenges of achieving this and propose two models for frame-level and video-level classification. The first is a highly efficient mixture of experts while the latter is based on long short term memory neural networks. We present results on the Sports-1M video dataset (1 million videos, 487 classes) and on a new dataset which has 12 million videos and 150,000 labels.
View details
Tracking Large-Scale Video Remix in Real-World Events
Lexing Xie
Xuming He
John R. Kender
Matthew L. Hill
John R. Smith
IEEE Transactions on Multimedia, vol. 15, no. 6 (2013), pp. 1244-1254
Preview abstract
Content sharing networks, such as YouTube, contain traces of both explicit online interactions (such as likes, comments, or subscriptions), as well as latent interactions (such as quoting, or remixing, parts of a video). We propose visual memes, or frequently re-posted short video segments, for detecting and monitoring such latent video interactions at scale. Visual memes are extracted by scalable detection algorithms that we develop, with high accuracy. We further augment visual memes with text, via a statistical model of latent topics. We model content interactions on YouTube with visual memes, defining several measures of influence and building predictive models for meme popularity. Experiments are carried out with over 2 million video shots from more than 40,000 videos on two prominent news events in 2009: the election in Iran and the swine flu epidemic. In these two events, a high percentage of videos contain remixed content, and it is apparent that traditional news media and citizen journalists have different roles in disseminating remixed content. We perform two quantitative evaluations for annotating visual memes and predicting their popularity. The proposed joint statistical model of visual memes and words outperforms an alternative concurrence model, with an average error of 2% for predicting meme volume and 17% for predicting meme lifespan.
View details
Multimedia Semantics: Interactions Between Content and Community
Hari Sundaram
Lexing Xie
Munmun De Choudhury
Yu-Ru Lin
Proceedings of the IEEE, vol. 100, no. 9 (2012)
Preview abstract
This paper reviews the state of the art and some emerging issues in research areas related to pattern analysis and monitoring of web-based social communities. This research area is important for several reasons. First, the presence of near-ubiquitous low-cost computing and communication technologies has enabled people to access and share information at an unprecedented scale. The scale of the data necessitates new research for making sense of such content. Furthermore, popular websites with sophisticated media sharing and notification features allow users to stay in touch with friends and loved ones; these sites also help to form explicit and implicit social groups. These social groups are an important source of information to organize and to manage multimedia data. In this article, we study how media-rich social networks provide additional insight into familiar multimedia research problems, including tagging and video ranking. In particular, we advance the idea that the contextual and social aspects of media are as important for successful multimedia applications as is the media content. We examine the interrelationship between content and social context through the prism of three key questions. First, how do we extract the context in which social interactions occur? Second, does social interaction provide value to the media object? Finally, how do social media facilitate the repurposing of shared content and engender cultural memes? We present three case studies to examine these questions in detail. In the first case study, we show how to discover structure latent in the social media data, and use the discovered structure to organize Flickr photo streams. In the second case study, we discuss how to determine the interestingness of conversations---and of participants---around videos uploaded to YouTube. Finally, we show how the analysis of visual content, in particular tracing of content remixes, can help us understand the relationship among YouTube participants. For each case, we present an overview of recent work and review the state of the art. We also discuss two emerging issues related to the analysis of social networks---robust data sampling and scalable data analysis.
View details
Scene Aligned Pooling for Complex Video Recognition
Liangliang Cao
Yadong Mu
Shih-Fu Chang
Gang Hua
John R. Smith
ECCV (2012), pp. 688-701
Preview abstract
Real-world videos often contain dynamic backgrounds and evolving people activities, especially for those web videos generated by users in unconstrained scenarios. This paper proposes a new visual representation, namely scene aligned pooling, for the task of event recognition in complex videos. Based on the observation that a video clip is often composed with shots of different scenes, the key idea of scene aligned pooling is to decompose any video features into concurrent scene components, and to construct classification models adaptive to different scenes. The experiments on two large scale real-world datasets including the TRECVID Multimedia Event Detection 2011 and the Human Motion Recognition Databases (HMDB) show that our new visual representation can consistently improve various kinds of visual features such as different low-level color and texture features, or middle-level histogram of local descriptors such as SIFT, or space-time interest points, and high level semantic model features, by a significant margin. For example, we improve the-state-of-the-art accuracy on HMDB dataset by 20% in terms of accuracy.
View details
Video Event Detection Using Temporal Pyramids of Visual Semantics with Kernel Optimization and Model Subspace Boosting
Noel C. F. Codella
Gang Hua
Matthew L. Hill
Liangliang Cao
Leiguang Gong
John R. Smith
ICME (2012), pp. 747-752
Social Media Use by Government: From the Routine to the Critical
Andrea Kavanaugh
Edward A. Fox
Stephen Sheetz
Seungwon Yang
Lin Tzy Li
Donald Shoemaker
Lexing Xie
Government Information Quarterly, vol. 29, no. 4 (2012), pp. 480-491
Semantic Model Vectors for Complex Video Event Recognition
Michele Merler
Bert Huang
Lexing Xie
Gang Hua
IEEE Transactions on Multimedia, vol. 14 (2012), pp. 88-101
Visual memes in social media: tracking real-world news in YouTube videos
Lexing Xie
John R. Kender
Matthew L. Hill
John R. Smith
ACM Multimedia (2011), pp. 53-62
Towards large scale land-cover recognition of satellite images
Noel C.F. Codella
Gang Hua
John R. Smith
Intl. Conference on Information, Communications and Signal Processing (ICICS) (2011), pp. 1-5
Image modality classification: a late fusion method based on confidence indicator and closeness matrix
Tracking Visual Memes in Rich-Media Social Communities
Social media use by government: from the routine to the critical
Andrea Kavanaugh
Edward A. Fox
Stephen Sheetz
Seungwon Yang
Lin Tyz Li
Travis Whalen
Donald Shoemaker
Lexing Xie
ACM Digital Government Conference, College Park, MD, USA (2011)
IBM Research and Columbia University TRECVID-2011 Multimedia Event Detection (MED) System
Liangliang Cao
Shih-Fu Chang
Noel C. F. Codella
Courtenay Cotton
Dan Ellis
Leiguang Gong
Matthew Hill
Gang Hua
John R. Kender
Michele Merler
Yadong Mu
John R. Smith
NIST TRECVID Workshop (2011)
Probabilistic visual concept trees
Multimedia semantics: opportunities and challenges
Multimedia Information Retrieval (2010), pp. 9-10
Design and evaluation of an effective and efficient video copy detection system
IBM Research TRECVID-2010 Video Copy Detection and Multimedia Event Detection System
Matthew L. Hill
Gang Hua
John R. Smith
Lexing Xie
Bert Huang
Michele Merler
Hua Ouyang
Mingyuan Zhou
TRECVID (2010)
The accuracy and value of machine-generated image tags: design and user evaluation of an end-to-end image tagging system
Video genetics: a case study from YouTube
John R. Kender
Matthew L. Hill
John R. Smith
Lexing Xie
ACM Multimedia (2010), pp. 1253-1258
Large-scale multimedia semantic concept modeling using robust subspace bagging and MapReduce
Rong Yan
Marc-Olivier Fleury
Michele Merler
John R. Smith
ACM workshop on Large-scale multimedia retrieval and mining (LS-MMRM) (2009), pp. 35-42
Evaluating application mapping scenarios on the Cell/B.E
Ana Lucia Varbanescu
Henk J. Sips
Kenneth A. Ross
Qiang Liu
John R. Smith
Lurng-Kuo Liu
Concurrency and Computation: Practice and Experience, vol. 21 (2009), pp. 85-100
Formal Models and Hybrid Approaches for Efficient Manual Image Annotation and Retrieval
Rong Yan
Murray Campbell
Semantic Mining Technologies for Multimedia Databases (2009), pp. 272-297
Hybrid Tagging and Browsing Approaches for Efficient Manual Image Annotation
IBM Research TRECVID-2009 Video Retrieval System
Shenghua Bao
Jane Chang
Matthew Hill
Michele Merler
John R. Smith
Dong Wang
Lexing Xie
Rong Yan
Yi Zhang
NIST TRECVID Workshop (2009)
A learning-based hybrid tagging and browsing approach for efficient manual image annotation
Multi-query interactive image and video retrieval -: theory and practice
Query-Adaptive Fusion for Multimodal Search
Lyndon S. Kennedy
Shih-Fu Chang
Proceedings of the IEEE, vol. 96, no. 4 (2008)
IBM Research TRECVID-2008 Video Retrieval System
John R. Smith
Jelena Tesic
Lexing Xie
Rong Yan
Wei Jiang
Michele Merler
TRECVID (2008)
IBM multimedia analysis and retrieval system
Web-based information content and its application to concept-based video retrieval
Data Modeling Strategies for Imbalanced Learning in Visual Search
Dynamic Multimodal Fusion in Video Search
Semantics reinforcement and fusion learning for multimedia streams
Digital Media Indexing on the Cell Processor
Lurng-Kuo Liu
Qiang Liu
Kenneth A. Ross
John R. Smith
Ana Lucia Varbanescu
ICME (2007), pp. 1866-1869
An Effective Strategy for Porting C++ Applications on Cell
Ana Lucia Varbanescu
Henk J. Sips
Kenneth A. Ross
Qiang Liu
Lurng-Kuo Liu
John R. Smith
ICPP (2007), pp. 59
An efficient manual image annotation approach based on tagging and browsing
Rong Yan
Murray Campbell
ACM Multimedia Workshop on the many faces of multimedia semantics (2007), pp. 13-20
IBM multimodal interactive video threading (demo)
IBM Research TRECVID-2007 Video Retrieval System
Murray Campbell
Alexander Haubold
Ming Liu
John R. Smith
Jelena Tesic
Lexing Xie
Rong Yan
Jun Yang 0003
TRECVID (2007)
IBM multimedia search and retrieval system (demo)
Cluster-based data modeling for semantic video search
A Greedy Performance Driven Algorithm for Decision Fusion Learning
Semantic concept-based query expansion and re-ranking for multimedia retrieval
Alexander Haubold
Jelena Tesic
Lexing Xie
Rong Yan
ACM Multimedia (2007), pp. 991-1000
IBM research TRECVID-2006 video retrieval system
Murray Campbell
Alexander Haubold
Shahram Ebadollahi
Milind R. Naphade
John R. Smith
Jelena Tesic
Lexing Xie
NIST TRECVID Workshop (2006)
Assessing the Filtering and Browsing Utility of Automatic Semantic Concepts for Multimedia Retrieval
Michael G. Christel
Milind R. Naphade
Jelena Tesic
CVPR'06 Workshop on Semantic Learning Applications in Multimedia (SLAM) (2006), pp. 117
Exploring Automatic Query Refinement for Text-Based Video Retrieval
Multimodal Search for Effective Video Retrieval (demo)
CIVR (2006), pp. 525-528
Semantic Multimedia Retrieval using Lexical Query Expansion and Model-Based Reranking
IBM research TRECVID-2005 video retrieval system
Arnon Amir
J. Argillander
Murray Campbell
Alexander Haubold
Giri Iyengar
Shahram Ebadollahi
F. Kang
Milind R. Naphade
John R. Smith
Jelena Tesic
Timo Volkmer
NIST TRECVID Workshop (2005)
Learning and classification of semantic concepts in broadcast video
John R. Smith
Murray Campbell
Milind R. Naphade
Jelena Tesic
International Conference of Intelligence Analysis (2005)
Automatic discovery of query-class-dependent models for multimodal search
A web-based system for collaborative annotation of large image and video collections: an evaluation and user study
Multimedia Research Challenges for Industry
Learning the semantics of multimedia queries and concepts from a small number of examples
Multi-granular detection of regional semantic concepts
Over-complete representation and fusion for semantic concept detection
Semantic representation: search and mining of multimedia content
Content transcoding middleware for pervasive geospatial intelligence access
Ching-Yung Lin
Belle L. Tseng
Matthew Hill
John R. Smith
Chung-Sheng Li
ICME (2004), pp. 2139-2142
WALRUS: A Similarity Retrieval Algorithm for Image Databases
Rajeev Rastogi
Kyuseok Shim
IEEE Trans. Knowl. Data Eng., vol. 16 (2004), pp. 301-316
Validity-weighted model vector-based retrieval of video
John R. Smith
Ching-Yung Lin
Milind R. Naphade
Belle L. Tseng
Storage and Retrieval Methods and Applications for Multimedia (2004), pp. 271-279
Multimodal video search techniques: late fusion of speech-based retrieval and visual content-based retrieval
Arnon Amir
Giri Iyengar
Ching-Yung Lin
Milind R. Naphade
Chalapathy Neti
Harriet J. Nock
John R. Smith
Belle Tseng
Intl. Conf. on Acoustics, Speech, and Signal Processing (ICASSP) (2004), pp. 1048-1051
Multisource Video Clustering Using Semantic Model Vectors
John R. Smith
Ching-Yung Lin
Milind R. Naphade
Multimedia Information Retrieval, AIDA Informazioni (2004)
IBM research TRECVID-2004 video retrieval system
Arnon Amir
J. Argillander
M. Berg
Shih-Fu Chang
M. Franz
Winston Hsu
Giri Iyengar
John R. Kender
Lyndon S. Kennedy
Ching-Yung Lin
Milind R. Naphade
John R. Smith
Jelena Tesic
Gang Wu
Rong Yan
Donqing Zhang
NIST TRECVID Workshop (2004)
Lexicon design for semantic indexing in media databases
Milind R. Naphade
John R. Smith
International Conference on Communication Technologies and Programming (2003)
Interactive search fusion methods for video database retrieval
John R. Smith
Alejandro Jaimes
Ching-Yung Lin
Milind R. Naphade
Belle L. Tseng
ICIP (1) (2003), pp. 741-744
MPEG-7 video automatic labeling system (demo)
Ching-Yung Lin
Belle L. Tseng
Milind R. Naphade
John R. Smith
ACM Multimedia (2003), pp. 98-99
IBM Research TRECVID-2003 Video Retrieval System
Arnon Amir
Marco Berg
Shih-Fu Chang
Winston Hsu
Giridharan Iyengar
Ching-Yung Lin
Milind R. Naphade
Chalapathy Neti
Harriet Nock
John R. Smith
Belle L. Tseng
Yi Wu
Donqing Zhang
NIST TRECVID Workshop (2003)
Multimedia semantic indexing using model vectors
Normalized classifier fusion for semantic visual concept detection
Belle L. Tseng
Ching-Yung Lin
Milind R. Naphade
John R. Smith
ICIP (2) (2003), pp. 535-538
A framework for moderate vocabulary semantic visual concept detection
Milind R. Naphade
Ching-Yung Lin
Belle L. Tseng
John R. Smith
ICME (2003), pp. 437-440
Statistical Techniques for Video Analysis and Searching
John R. Smith
Ching-Yung Lin
Milind R. Naphade
Belle L. Tseng
Video Mining, Kluwer Academic Publishers (2003)
New anchor selection methods for image retrieval
Exploring semantic dependencies for scalable concept detection
Active selection for multi-example querying by content
VideoAL: a novel end-to-end MPEG-7 video automatic labeling system
Ching-Yung Lin
Belle L. Tseng
Milind R. Naphade
John R. Smith
ICIP (3) (2003), pp. 53-56
User-trainable video annotation using multimodal cues
Ching-Yung Lin
Milind R. Naphade
Chalapathy Neti
John R. Smith
Belle L. Tseng
Harriet J. Nock
W. Adams
SIGIR (2003), pp. 403-404
Aggregate Predicate Support in DBMS
Gene Y. C. Fuh
Weidong Chen
Chi-Huang Chiu
Jeffrey Scott Vitter
Australasian Database Conference (2002)
A study of image retrieval by anchoring
IBM Research TREC 2002 Video Retrieval System
Bill Adams
Giridharan Iyengar
Chalapathy Neti
Harriet J. Nock
Arnon Amir
Haim H. Permuter
Savitha Srinivasan
Chitra Dorai
Alejandro Jaimes
Christian A. Lang
Ching-Yung Lin
Milind R. Naphade
John R. Smith
Belle L. Tseng
Sugata Ghosal
Raghavendra Singh
T. V. Ashwin
DongQing Zhang
TREC (2002)
Spatial and feature normalization for content-based retrieval
CAMEL: concept annotated image libraries
Atul Chadha
Basuki Soetarman
Jeffrey Scott Vitter
Storage and Retrieval for Media Databases (2001), pp. 62-73
Supporting Incremental Join Queries on Ranked Inputs
Yuan-Chi Chang
John R. Smith
Chung-Sheng Li
Jeffrey Scott Vitter
VLDB (2001), pp. 281-290
Constrained querying of multimedia databases: issues and approaches
John R. Smith
Yuan-Chi Chang
Chung-Sheng Li
Jeffrey Scott Vitter
Storage and Retrieval for Media Databases (2001), pp. 74-85
Text compression via alphabet re-representation
WALRUS: A Similarity Retrieval Algorithm for Image Databases
Text Compression Via Alphabet Re-Representation