Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis
Venue
Conference on Computer Vision and Pattern Recognition (2011)
Publication Year
2011
Authors
Quoc V. Le, Will Zou, Serena Yeung, Andrew Y. Ng
BibTeX
Abstract
Previous work on action recognition has focused on adapting hand-designed local
features, such as SIFT or HOG, from static images to the video domain. In this
paper, we propose using unsupervised feature learning as a way to learn features
directly from video data. More specifically, we present an extension of the
Independent Subspace Analysis algorithm to learn invariant spatio-temporal features
from unlabeled video data. We discovered that, despite its simplicity, this method
performs surprisingly well when combined with deep learning techniques such as
stacking and convolution to learn hierarchical representations. By replacing
hand-designed features with our learned features, we achieve classification results
superior to all previous published results on the Hollywood2, UCF, KTH and YouTube
action recognition datasets. On the challenging Hollywood2 and YouTube action
datasets we obtain 53.3% and 75.8% respectively, which are approximately 5% better
than the current best published results. Further benefits of this method, such as
the ease of training and the efficiency of training and prediction, will also be
discussed. You can download our code and learned spatio-temporal features here:
http://ai.stanford.edu/∼wzou/
