Informedia@TRECVID 2011: Surveillance Event Detection

Lei Bao, Longfei Zhang, Shoou-I Yu, Zhen-zhong Lan, Lu Jiang, Arnold Overwijk, Qin Jin, Shohei Takahashi, Brian Langner, Yuanpeng Li, Michael Garbus, Susanne Burger (+2 others)
2011 TREC Video Retrieval Evaluation  
The Informedia group participated in three tasks this year, including Multimedia Event Detection (MED), Semantic Indexing (SIN) and Surveillance Event Detection (SED). The first half of the report describes our efforts on MED and SIN, while the second part discusses our approaches to SED. For Multimedia Event Detection and Semantic Indexing of concepts, generally, both of these tasks consist of three main steps: extracting features, training detectors and fusion. In the feature extraction part,
more » ... we extracted many low-level features, high-level features and text features. Specifically, we used the Spatial-Pyramid Matching technique to represent the low-level visual local features, such as SIFT and MoSIFT, which describe the location information of feature points. In the detector training part, besides the traditional SVM, we proposed a Sequential Boosting SVM classifier to deal with the large-scale unbalanced data classification problem. In the fusion part, to take the advantage of different features, we tried three different fusion methods: early fusion, late fusion and double fusion. Double fusion is a combination of early fusion and late fusion. The experimental results demonstrated that double fusion is consistently better than or at worst comparable to early fusion and late fusion. The Surveillance Event Detection report in the second half of this paper presents a generic event detection system evaluated in the SED task of TRECVID 2011. We investigated a generic statistical approach with spatio-temporal features applied to seven events, which were defined by the SED task. This approach is based on local spatio-temporal descriptors, called MoSIFT, and generated from pair-wise video frames. Visual vocabularies are generated by cluster centers of MoSIFT features, which were sampled from the video clips. We also estimated the spatial distribution of actions by over-generated person detection and background subtraction. Different sliding window sizes and steps were adopted for different events based on the event duration priors. Several sets of one-against-all action classifiers were trained using cascade non-linear SVMs and Random Forests, which improved the classification performance on unbalanced data just like the SED datasets. Results of 9 runs were presented with variations in i) sliding window size ii) step size of BOW, iii) classifier threshold and iv) classifiers. The performance shows improvement over last year on the event detection task.
dblp:conf/trecvid/BaoZYL0OJTLLGBM11 fatcat:a3eteiiit5cy7el5epiqvphqsi