Coarse-to-Fine Spatial-Temporal Relationship Inference for Temporal Sentence Grounding

Shanshan Qi, Luxi Yang, Chunguo Li, Yongming Huang
2021 IEEE Access  
Temporal sentence grounding aims to ground a query sentence into a specific segment of the video. Previous methods follow the common equally-spaced frame selection mechanism for appearance and motion modeling, which fails to consider redundant and distracting visual information. There is also no guarantee that all meaningful frames can be obtained. Moreover, this task needs to detect the location clues precisely from both spatial and temporal dimensions, but the relationship between
more » ... oral semantic information and query sentence is still unexplored in existing methods. Inspired by human thinking patterns, we propose a Coarse-to-Fine Spatial-Temporal Relationship Inference (CFSTRI) network to progressively localize fine-grained activity segments. Firstly, we present a coarse-grained crucial frame selection module, where the query-guided local difference context modeling from adjacent frames helps discriminate all the coarse boundary locations relevant to the sentence semantics, and the soft assignment vector of locally aggregated descriptors are employed to enhance the representation of selected frames. Then, we develop a fine-grained spatial-temporal relationship matching module to refine the coarse boundaries, which disentangles the spatial and temporal semantic information from query sentence to guide the excavation of visual grounding clues of corresponding dimensions. Furthermore, we devise a gated graph convolution network to incorporate the spatial-temporal semantic information by leveraging a gate operation to highlight frames referred to by the query sentence from spatial and temporal dimensions, and propagate fused information on the graph. Extensive experiments on two benchmark datasets demonstrate that our CFSTRI significantly outperforms most state-of-the-art methods. INDEX TERMS Temporal sentence grounding, coarse-grained crucial frame selection, fine-grained spatial-temporal relationship matching, gated graph convolution network. 97430 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ VOLUME 9, 2021 CHUNGUO LI (Senior Member, IEEE) received the bachelor's degree in wireless communications from Shandong University, in 2005, and the Ph.D. degree in wireless communications from Southeast University,
doi:10.1109/access.2021.3095229 fatcat:jow3mzxavfaohemaxgk4d2buey