Sparse Spatiotemporal Coding for Activity Recognition
We present a new approach to learning sparse, spatiotemporal features and demonstrate the utility of the approach by applying the resulting sparse codes to the problem of activity recognition. Learning features that discriminate among human activities in video is difficult in part because the stable space-time events that reliably characterize the relevant motions are rare. To overcome this problem, we adopt a multi-stage approach to activity recognition. In the initial preprocessing stage, we first whiten and apply local contrast normalization to each frame of the video. We then apply an additional set of filters to identify and extract salient space-time volumes that exhibit smooth periodic motion. We collect a large corpus of these space-time volumes as training data for the unsupervised learning of a sparse, over-complete basis using a variant of the two-phase analysis-synthesis algorithm of Olshausen and Field . We treat the synthesis phase, which consists of reconstructing the input as sparse a mostly coefficient zero and most importantly the time required for reconstruction in subsequent use production we adapted existing algorithms to exploit potential parallelism through the use of readily-available SIMD hardware. To obtain better codes, we developed a new approach to learning sparse, spatiotemporal codes in which the number of basis vectors, their orientations, velocities and the size of their receptive fields change over the duration of unsupervised training. The algorithm starts with a relatively small, initial basis with minimal temporal extent. This initial basis is obtained through conventional sparse coding techniques and is expanded over time by recursively constructing a new basis consisting of basis vectors with larger temporal extent that proportionally conserve regions of previously trained weights. These proportionally conserved weights are combined with the result of adjusting newly added weights to represent a greater range of primitive motion features. The size of the current basis is determined probabilistically by sampling from existing basis vectors according to their activation on the training set. The resulting algorithm produces bases consisting of filters that are bandpass, spatially oriented and temporally diverse in terms of their transformations and velocities. We demonstrate the utility of our approach by using it to recognize human activity in video.