Publication Data
Video2Text: Learning to Annotate Video Content
Abstract: This paper discusses a new method for automatic discovery
and organization of descriptive concepts (labels) within large real-world corpora of
user-uploaded multimedia, such as YouTube.com. Conversely, it also provides validation
of existing labels, if any. While training, our method does not assume any explicit
manual annotation other than the weak labels already available in the form of video
title, descrip- tion, and tags. Prior work related to such auto-annotation assumed that
a vocabulary of labels of interest (e.g., indoor, outdoor, city, landscape) is specified
a priori. In contrast, the proposed method begins with an empty vocabulary. It analyzes
audiovisual features of 25 million YouTube.com videos – nearly 150 years of video data
– effectively searching for consistent correlation between these features and text
metadata. It autonomously extends the label vocabulary as and when it discovers
concepts it can reliably identify, eventually leading to a vocabulary with thousands of
labels and growing. We believe that this work significantly extends the state of the art
in multimedia data mining, discovery, and organization based on the technical merit of
the proposed ideas as well as the enormous scale of the mining exercise in a very
challenging, unconstrained, noisy domain.
