A Multi-stream Bi-directional Recurrent Neural Network for Fine-Grained Action Detection

Bharat Singh, Tim K. Marks, Michael Jones, Oncel Tuzel, Ming Shao
2016 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)  
We present a multi-stream bi-directional recurrent neural network for fine-grained action detection. Recently, twostream convolutional neural networks (CNNs) trained on stacked optical flow and image frames have been successful for action recognition in videos. Our system uses a tracking algorithm to locate a bounding box around the person, which provides a frame of reference for appearance and motion and also suppresses background noise that is not within the bounding box. We train two
more » ... al streams on motion and appearance cropped to the tracked bounding box, along with full-frame streams. Our motion streams use pixel trajectories of a frame as raw features, in which the displacement values corresponding to a moving scene point are at the same spatial position across several frames. To model long-term temporal dynamics within and between actions, the multi-stream CNN is followed by a bi-directional Long Short-Term Memory (LSTM) layer. We show that our bi-directional LSTM network utilizes about 8 seconds of the video sequence to predict an action label. We test on two action detection datasets: the MPII Cooking 2 Dataset, and a new MERL Shopping Dataset that we introduce and make available to the community with this paper. The results demonstrate that our method significantly outperforms state-of-the-art action detection methods on both datasets.
doi:10.1109/cvpr.2016.216 dblp:conf/cvpr/SinghMJTS16 fatcat:jz2aip3dg5gszcj7z3cpcj45qy