VIREO/ECNU @ TRECVID 2013: A Video Dance of Detection, Recounting and Search with Motion Relativity and Concept Learning from Wild
TREC Video Retrieval Evaluation
The VIREO group participated in four tasks: instance search, multimedia event recounting, multimedia event detection, and semantic indexing. In this paper, we will present our approaches and discuss the results submitted to TRECVID 2013  . Instance Search (INS): We submitted four runs in total, experimenting three search paradigms for particular objects retrieval: (1) an elastic spatial consistency checking method; (2) a background context weighting strategy; and (3) a re-ranking step based
... n objects mining. The first two approaches are similar as last year , while the last one is our new exploration. Our submissions are all based on BoW model and tailored for the INS task. In particular, we use Delaunay Triangulation (DT) to address the complex spatial transformations for non-planar and non-rigid queries; the lack of information for small query objects is tackled with context modeling; and object mining augments the results by exploring frequent instances in TV series. -F X NO vireo dt 2: BoW method + elastic spatial checking via DT. This run corresponds to our paradigm (1), which models elastic spatial structures as deformable graphs. -F X NO vireo dtc 1: vireo dt + context modeling. This run corresponds to our paradigm (2) by weighting the importance of different features in the query. -F X NO vireo dtm 4: The mining result is fused with the results by vireo dt via random walk (paradigms (1) + ( 3 )). Links established by our mining algorithm serves as the cues for re-ranking. -F X NO vireo dtcm 3: The mining result is fused with vireo dtc through random walk (paradigms (2) + ( 3 )). This run uses the ranking list from vireo dtc. Multimedia Event Detection (MED): In this year's MED task, we submitted two runs to evaluate our visual and full systems respectively. -FullSys PROGAll PS 100Ex 1: Detectors trained by combining visual and audio features. -VisualSys PROGAll PS 100Ex 1: Visual features including SIFT, ColorSIFT, Motion relativity, and STIP are used for event detection. Multimedia Event Recounting (MER): We submitted the recounting for the positive videos based on the evidences from the audio-visual concepts. The visual evidences are built upon a graphical network and recounting is generated by exploiting the network's ontology. In particular, we implemented object/scene, action and non-speech audio detectors for evidence collection. Besides that, important keywords are mined from the ASR and OCR output as the supplementary evidences. Semantic Indexing (SIN): This year, we focused on a new feature representation extracted using deep neural networks (DNN). In the semantic indexing system, we adopted DNN feature, local and global features to train SVM models for each concept. Then we evaluated the contributions of different features using several fusion strategies for SIN  . In addition, we submitted two runs for "no annotation" using Web images crawled from Flickr as training examples. These two runs are based on the model developed in  . In total, we submitted five runs as summarized below: -13 M A vireo.Baseline+DNN 1: Fusing the detection scores of classifiers using two global features, three local features and the new DNN feature. -13 M A vireo.DNN 2: Concept detectors are learned using DNN feature. -13 M A vireo.Baseline 3: Same with the baseline of our TRECVID 2012 systems, where global and local features are used. -13 M F vireo.SP 4: Concept detectors are learned on the training set sampled from Web images using Semantic Pooling (SP) method  . Both local and global visual features are used. -13 M F vireo.SP KW 5: Training set is same with the run "13 M F vireo.SP 4", but only local features are employed.