Beyond Short Snippets: Deep Networks for Video Classification
Venue
Computer Vision and Pattern Recognition (2015)
Publication Year
2015
Authors
Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, George Toderici
BibTeX
Abstract
Convolutional neural networks (CNNs) have been extensively applied for image
recognition problems giving state-of-the-art results on recognition, detection,
segmentation and retrieval. In this work we propose and evaluate several deep
neural network architectures to combine image information across a video over
longer time periods than previously attempted. We propose two methods capable of
handling full length videos. The first method explores various convolutional
temporal feature pooling architectures, examining the various design choices which
need to be made when adapting a CNN for this task. The second proposed method
explicitly models the video as an ordered sequence of frames. For this purpose we
employ a recurrent neural network that uses Long Short-Term Memory (LSTM) cells
which are connected to the output of the underlying CNN. Our best networks exhibit
significant performance improvements over previously published results on the
Sports 1 million dataset (73.1% vs. 60.9%) and the UCF-101 datasets with (88.6% vs.
88.0%) and without additional optical flow information (82.6% vs. 72.8%).
