BBN VISER TRECVID 2014 Multimedia Event Detection and Multimedia Event Recounting Systems

Florian Luisier, Manasvi Tickoo, Walter Andrews, Guangnan Ye, Dong Liu, Shih-Fu Chang, Ruslan Salakhutdinov, Vlad I. Morariu, Larry Davis, Abhinav Gupta, Ismail Haritaoglu, Sadiye Guler (+1 others)
2014 TREC Video Retrieval Evaluation  
In this paper, we describe the Raytheon BBN Technologies (BBN) led VISER system for the TRECVID 2014 Multimedia Event Detection (MED) and Recounting (MER) tasks. We present a comprehensive analysis of the different modules: (1) Metadata Generator (MG) -a large suite of audio-visual low-level and sematic features; a set of deep convolutional neural network (DCNN) features trained on the ImageNet dataset; automatic speech recognition (ASR); videotext detection and recognition (OCR). For the
more » ... vel features, we used D-SIFT, Opponent SIFT, dense trajectories (HOG+HOF+MBH), MFCC and Fisher Vector (FV) representation. For the semantic concepts, we have trained 1,800 weakly supervised concepts from the Research Set videos and a set of YouTube videos. These concepts include objects, actions, scenes, as well as noun-verb bigrams. We also consider the output layer of the DCNN as a 1,000-dimensional semantic feature. For the speech and videotext content, we leveraged rich confidence-weighted keywords and phrases obtained from the BBN ASR and OCR systems. (2) Event Query Generation (EQG) -linear SVM event models are trained for each feature and combined using probabilistic late fusion framework. Our system involves both SVM-based and query-based detections, to achieve superior performance despite the varying number of positive videos in the different training conditions. We present a thorough study and evaluation of different features used in our system. (3) Event Search (ES) -At search time, simple dot products with the SVM hyperplane are computed for each feature and consequently rescaled into a posterior probability score for each video. (4) Semantic Query Generation (SQG) -we use the Indri Document Retrieval System to search for the closest words/terms to the given event name in a static offline corpus of around 100,000 Wikipedia and Gigapedia documents. After basic text processing, these words (ranked by TF-IDF measure) are then projected into our concept vocabulary using a text corpus knowledge source (e.g., Gigawords). Further, to meet the tight timing constraints of this year's evaluations for all the above mentioned modules, we compressed all our features to 1 byte per dimension which significantly sped up the feature extraction, training and event search timings. Owing to optimized feature design, model training and strong compression scheme, we could drastically reduce the time spent for disk I/O during MG, EQG and ES. Consistent with previous MED evaluations, low-level features still exhibit strong performance. However, their MED performance can now be matched by purely semantic features, even in the 100Ex and 010Ex training conditions. As a result, our modules stand amongst the fastest, while maintaining very strong performance. Our mean average precision (MAP) and Minimum Acceptable Recall (MR 0 ) results are consistently among the top 3 performers for all training conditions for both prespecified and ad-hoc events. For the MER task, while 65% of the evaluators found the key evidence convincing, 75 % of the judges agreed with the query conciseness and logical appeal.
dblp:conf/trecvid/LuisierTAYLCSM014 fatcat:jkhh2gyjsna4rktf4ba76v3fvu