SRI-Sarnoff AURORA System at TRECVID 2012 Multimedia Event Detection and Recounting

Hui Cheng, Jingen Liu, Saad Ali, Omar Javed, Qian Yu, Amir Tamrakar, Ajay Divakaran, Harpreet S. Sawhney, R. Manmatha, James Allan, Alexander G. Hauptmann, Mubarak Shah (+7 others)
2012 TREC Video Retrieval Evaluation  
In this paper, we describe the evaluation results for TRECVID 2012 Multimedia Event Detection (MED) and Multimedia Event Recounting (MER) tasks as a part of SRI-Sarnoff AURORA system that is developed under the IARPA ALDDIN program. In AURORA system, we incorporated various low-level features that capture color, appearance, motion, and audio information in videos. Based on these low-level features, we developed Fixed-Pattern and Object-Orientated spatial feature pooling, which result in
more » ... ant performance improvement to our system. In addition, we collected more than 1800 concepts and designed a set of concept pooling approaches to build the Concept Based Event Representation (CBER, i.e., high-level features). We submitted six runs exploring various fusions of low-level features, high-level features, and ASR/OCR features for MED task. All runs achieve satisfactory results. In particular, two EK10Ex runs for both pre-specified events (PS-Events) and ad-hoc events (AH-Events) obtain relatively better results. In MER task, we developed an approach to provide a breakdown of the evidences of why the MED decision has been made by exploring the SVM-based event detector. Furthermore, we designed evidence specific verification and detection to reduce uncertainty and improve key evidence discovery. Our MER evaluation results for MER-to-Event are very good. Multimedia Event Detection The AURORA system incorporates two types of features: low-level features and high-level features. Low-level features are designed to acquire the first-hand characteristics of an event, such as the involved object appearance, color and motion information, and scene structure. These low-level features are quantized into visual-words, which is used to model an event as a Bag of Visual Words (BOW). We treat this BOW as an average feature pooling over the whole frame. However, a specific event typically has its own Region of Interests that produce most informative evidence of this event. Hence, we propose a new strategy for spatial pooling of the low-level features, which result in an event model capturing spatial information. However, the low-level feature based event model training usually needs a large number of training examples for better model generalization. This is because of the diversity of visual/audio contents. To achieve better model generalization with less training examples, we developed over 1,800 visual concepts, from which we derived various concept features. The concept features also enable us to do better MER. We describe all the details in the following sections. Low-Level Visual Features We developed a variety of low-level features to capture various aspects of an event, such as scene, object, action, and so on. There features are extracted either from sample frames (static features), or spatiotemporal windows of frames (i.e., XYT-volumes, dynamic features) of a video. Static Features Static features are computed from sampled frames (i.e., one sample every second). They are assumed to provide object or scene appearance information of an event. Following static features are extracted: A. GIST [1]: This feature was proposed to capture the structure/shape of real world scenes. Basically it is a holistic statistical signature of an image, representing the scene with Spatial Envelope consisting of a set of perceptual dimensions (e.g., naturalness, openness, roughness, expansion, and ruggedness). It is a fast approach to coarsely capture the event scene structure. We quantize the GIST feature of each frame, and represent the video as a bag of quantized gist features. B. SIFT [2] : SIFT feature is a widely used feature descriptor for image matching and classification. The 128 dimensional SIFT descriptor is rotation invariant, which captures the local texture structure of an image. We extracted two types of SIFT features: sparse SIFT (S-SIFT) and dense SIFT (D-SIFT). S-SIFT is computed around an interest point detected by corner detector, and D-SIFT is computed for dense sampled image patches. The former one is used to describe informative patches of an object, while the latter is good to capture local patch distribution over a scene. C. colorSIFT [3] : This feature is an extension of SIFT. Instead of computing SIFT based on intensity gradient, colorSIFT detects interest points and create descriptors on color gradients. It actually contains 3 128 dimensional vector with first one from intensity gradient and the other two from color gradient. As a result, it is able to capture both intensity and color information. D. Transformed Color Histogram [4]: It is a normalized color histogram as describe in [4]. Dynamic Motion Features Dynamic features are computed from detected XYT-volumes of a video. These XYT-volumes are sampled by detecting spatio-temporal interesting points or 2D corner point trajectories. They are supposed to
dblp:conf/trecvid/ChengLAJYTDSMAH12 fatcat:gifuecxyvjcndmjrecbul4v43i