Large-scale Video Classification with Convolutional Neural Networks
Venue
Proceedings of International Computer Vision and Pattern Recognition (CVPR 2014), IEEE
Publication Year
2014
Authors
Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, Li Fei-Fei
BibTeX
Abstract
Convolutional Neural Networks (CNNs) have been established as a powerful class of
models for image recognition problems. Encouraged by these results, we provide an
extensive empirical evaluation of CNNs on large-scale video classification using a
dataset of 1 million YouTube videos belonging to 487 classes. We study multiple
approaches for extending the connectivity of a CNN in time domain to take advantage
of local spatio-temporal information and suggest a multi-resolution, foveated
architecture as a promising way of regularizing the learning problem and speeding
up training. Our best spatio-temporal networks display significant performance
improvements compared to strong feature-based baselines (55.3% to 63.9%), but only
a surprisingly modest improvement compared to single-frame models (59.3% to 60.9%).
We further study the generalization performance of our best model by retraining the
top layers on the UCF-101 action Recognition dataset and observe significant
performance improvements compared to the UCF-101 baseline model (63.3% up from
43.9%).
