Weakly Supervised Learning of Object Segmentations from Web-Scale Video
Venue
ECCV'12 Proceedings of the 12th international conference on Computer Vision - Volume Part I, Springer-Verlag, Berlin, Heidelberg (2012), pp. 198-208
Publication Year
2012
Authors
Glenn Hartmann, Matthias Grundmann, Judy Hoffman, David Tsai, Vivek Kwatra, Omid Madani, Sudheendra Vijayanarasimhan, Irfan Essa, James Rehg, Rahul Sukthankar
BibTeX
Abstract
We propose to learn pixel-level segmentations of objects from weakly labeled
(tagged) internet videos. Specifically, given a large collection of raw YouTube
content, along with potentially noisy tags, our goal is to automatically generate
spatiotemporal masks for each object, such as "dog", without employing any
pre-trained object detectors. We formulate this problem as learning weakly
supervised classifiers for a set of independent spatio-temporal segments. The
object seeds obtained using segment-level classifiers are further refined using
graphcuts to generate high-precision object masks. Our results, obtained by
training on a dataset of 20,000 YouTube videos weakly tagged into 15 classes,
demonstrate automatic extraction of pixel-level object masks. Evaluated against a
ground-truthed subset of 50,000 frames with pixel-level annotations, we confirm
that our proposed methods can learn good object masks just by watching YouTube.
