Sparse Spatiotemporal Coding for Activity Recognition
Venue
Brown University (2010)
Publication Year
2010
Authors
Thomas Dean, Greg Corrado, Rich Washington
BibTeX
Abstract
We present a new approach to learning sparse, spatiotemporal features and
demonstrate the utility of the approach by applying the resulting sparse codes to
the problem of activity recognition. Learning features that discriminate among
human activities in video is difficult in part because the stable space-time events
that reliably characterize the relevant motions are rare. To overcome this problem,
we adopt a multi-stage approach to activity recognition. In the initial
preprocessing stage, we first whiten and apply local contrast normalization to each
frame of the video. We then apply an additional set of filters to identify and
extract salient space-time volumes that exhibit smooth periodic motion. We collect
a large corpus of these space-time volumes as training data for the unsupervised
learning of a sparse, over-complete basis using a variant of the two-phase
analysis-synthesis algorithm of Olshausen and Field [1997]. We treat the synthesis
phase, which consists of reconstructing the input as sparse a mostly coefficient
zero and most importantly the time required for reconstruction in subsequent use
production we adapted existing algorithms to exploit potential parallelism through
the use of readily-available SIMD hardware. To obtain better codes, we developed a
new approach to learning sparse, spatiotemporal codes in which the number of basis
vectors, their orientations, velocities and the size of their receptive fields
change over the duration of unsupervised training. The algorithm starts with a
relatively small, initial basis with minimal temporal extent. This initial basis is
obtained through conventional sparse coding techniques and is expanded over time by
recursively constructing a new basis consisting of basis vectors with larger
temporal extent that proportionally conserve regions of previously trained weights.
These proportionally conserved weights are combined with the result of adjusting
newly added weights to represent a greater range of primitive motion features. The
size of the current basis is determined probabilistically by sampling from existing
basis vectors according to their activation on the training set. The resulting
algorithm produces bases consisting of filters that are bandpass, spatially
oriented and temporally diverse in terms of their transformations and velocities.
We demonstrate the utility of our approach by using it to recognize human activity
in video.
