VIREO @ TRECVID 2017: Video-to-Text, Ad-hoc Video Search, and Video hyperlinking

Phuong Anh Nguyen, Qing Li, Zhi-Qi Cheng, Yi-Jie Lu, Hao Zhang, Xiao Wu, Chong-Wah Ngo
2017 TREC Video Retrieval Evaluation  
In this paper, we describe the systems developed for Video-to-Text (VTT), Ad-hoc Video Search (AVS) and Video Hyper-linking (LNK) tasks at TRECVID 2017 [1] and the achieved results. Video-to-Text Description (VTT): We participate in the TRECVID 2017 pilot Task of Video-to-Text Description, which consist of two subtasks, .i.e, Matching and Ranking, and Description Generation. Matching and Ranking task : To compare the effectiveness of spatial and temporal attentions, we experiment: -No attention
more » ... model: Each video is represented by average pooling over both the spatial and temporal dimensions of the ResNet-152 features extracted from frames and text description is encoded by LSTM. Then we learn an embedding space to minimize the distance between the corresponding video and text description in the format of triple loss. Furthermore, C3D is utilized to extract the motion features of videos. Similarity scores from two kinds of features are averagely fused for the final ranking. -Spatial attention model: Average pooling is only used in the temporal dimension and the spatial dimension is kept. Then we train attention model on the spatial feature map of video to compute the similarity score, which is used for final ranking. -Temporal attention model: Different from the spatial attention model, we train attention model on frame-level, and perform average pooling over the spatial dimension. -No-spatial-temporal attention model: similarity scores from the above three models are averagely fused for the final ranking. Description Generation task : We adopt the similar approach as the matching and ranking task and the difference is that LSTM is used to generate the sentence word by word. More details about the model can be seen in [2, 3] . Our submission can be summarized as: -No attention model: Each video is represented by average pooling over both the spatial and temporal dimension of the ResNet-152 features extracted from frames and LSTM is used to generate the sentence word-by-word. Furthermore, we concatenate the features of ResNet-152 and C3D to feed into the LSTM to generate descriptions for videos. -Spatial attention model: For video features, average pooling is only used in temporal dimension and the spatial dimension is kept. Then we train attention model to access different feature when generating different words in the sentence. -Temporal attention model: For video features, this model does the average pooling over the spatial dimension, and learn attention model over the temporal dimension. Ad-Hoc Video Search (AVS) We merged three search systems for AVS. One is our concept-based, zero-example video search system which has been proved useful in previous years [4], one is a video captioning system which is individually trained in VTT task, the other is a text-based search system which computes similarities between query and videos using the metadata extracted from the videos. In this study, we intend to find whether the combination of the concept-based system, captioning system and text-based search system would do any help to improve search performance. We submit 5 fully automatic runs and 3 manually-assisted runs. Our runs are listed as follows:
dblp:conf/trecvid/NguyenLCL00N17 fatcat:jjfp3n7qunfg5mvnrmdmf4m5dq