Discriminative Topics Modelling for Action Feature Selection and Recognition

Matteo Bregonzio, Jian Li, Shaogang Gong, Tao Xiang
2010 Procedings of the British Machine Vision Conference 2010  
Problem -This paper addresses the problem of recognising realistic human actions captured in unconstrained environments (Fig. 1) . Existing approaches for action recognition have been focused on improving visual feature representation using either spatio-temporal interest points or key-points trajectories. However, these methods are insufficient to handle the situations when action videos are recorded in unconstrained environments because: (1) Reliable visual features are hard to be extracted
more » ... d to be extracted due to occlusions, illumination change, scale variation and background clutters. (2) Effectiveness of visual features are strongly dependent on the unpredictable characteristics of camera movements. (3) Complicated visual actions result in unequal discriminativeness of visual features. Our Solutions -In this paper, we present a novel framework for recognising realistic human actions in unconstrained environments. The novelties of our work lie in three aspects: First, we propose a new action representation based on computing a rich set of descriptors from key point trajectories. Second, in order to cope with drastic changes in motion characteristics with and without camera movements, we develop an adaptive feature fusion method to combine different local motion descriptors for improving model robustness against feature noise and background clutters. Finally, we propose a novel Multi-Class Delta Latent Dirichlet Allocation (MC-∆LDA) model for feature selection. The most informative features in a high dimensional feature space are selected collaboratively rather than independently. Motion Descriptors -We first compute trajectories of key-points using KLT tracker and SIFT matching. After trajectory pruning by identifying the Region of Interest (ROI), we compute three types of motion descriptors from the survived trajectories. First, Orientation-Magnitude Descriptor is extracted by quantising orientation and magnitude of motion between two consecutive points in the same trajectory. Second, Trajectory Shape Descriptor is extracted by computing Fourier coefficients of a single trajectory. Finally, Appearance Descriptor is extracted by computing the SIFT features at all points of a trajectory. Interest Point Features -We also detect spatio-temporal interest points as they contain complementary information to trajectory features. At an interest point, a surrounding 3D cuboid is extracted. We use gradient vectors to describe these cuboids and PCA to reduce descriptor's dimensionality. Adaptive Feature Fusion -We wish to fuse adaptively trajectory based descriptors with 3D interest point based descriptors according the presence of camera movement. The presence of moving camera is detected by computing the global optical flow over all frames in a clip. If the majority of the frames contain global motion, we regard the clip as being recorded by a moving camera. For clips without camera movement, both interest point and trajectory based descriptors can be computed reliably and thus both types of descriptors are used for recognition. In contrast, when camera motion can be detected, interest point based descriptors are less meaningful so only trajectory descriptors are employed. Collaborative Feature Selection -We propose a MC-∆LDA model ( Fig. 2) for collaboratively selecting dominant features for classification. We consider each video clip x j is a mixture of N t topics Φ = {φ t } N t t=1 (to be discovered), each of which φ t is a multinomial distribution over N w words (visual features). The MC-∆LDA model aims to constrain topic proportion non-uniformly and on a per-clip basis. For each video clip belonging to action category A c , we model it as a mixture of: (1) N s t topics which are shared by all N c category of actions, and (2) N t,c topics which are uniquely associated with action category A c . In MC-∆LDA, the nonuniform proportion of topic mixture for a single clip x j is enforced by its action class label c j and the hyperparameter α c for the corresponding action class c. Given the total number of topics N t = N s t + ∑
doi:10.5244/c.24.8 dblp:conf/bmvc/BregonzioLGX10 fatcat:lkpdyycxmngu7cyg37n4vquit4