Temporal Localization of Fine-Grained Actions in Videos by Domain Transfer from Web Images
Venue
ACM Multimedia (2015)
Publication Year
2015
Authors
Chen Sun, Sanketh Shetty, Rahul Sukthankar, Ram Nevatia
BibTeX
Abstract
We address the problem of fine-grained action localization from temporally
untrimmed web videos. We assume that only weak video-level annotations are
available for training. The goal is to use these weak labels to identify temporal
segments corresponding to the actions, and learn models that generalize to
unconstrained web videos. We find that web images queried by action names serve as
well-localized highlights for many actions, but are noisily labeled. To solve this
problem, we propose a simple yet effective method that takes weak video labels and
noisy image labels as input, and generates localized action frames as output. This
is achieved by cross-domain transfer between video frames and web images, using
pre-trained deep convolutional neural networks. We then use the localized action
frames to train action recognition models with long short-term memory networks. We
collect a fine-grained sports action data set FGA-240 of more than 130,000 YouTube
videos. It has 240 fine-grained actions under 85 sports activities. Convincing
results are shown on the FGA-240 data set, as well as the THUMOS 2014 localization
data set with untrimmed training videos.
