Object-Aware Multi-Branch Relation Networks for Spatio-Temporal Video Grounding

Zhu Zhang, Zhou Zhao, Zhijie Lin, Baoxing Huai, Jing Yuan
2020 Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence  
Spatio-temporal video grounding aims to retrieve the spatio-temporal tube of a queried object according to the given sentence. Currently, most existing grounding methods are restricted to well-aligned segment-sentence pairs. In this paper, we explore spatio-temporal video grounding on unaligned data and multi-form sentences. This challenging task requires to capture critical object relations to identify the queried target. However, existing approaches cannot distinguish notable objects and
more » ... n in ineffective relation modeling between unnecessary objects. Thus, we propose a novel object-aware multi-branch relation network for object-aware relation discovery. Concretely, we first devise multiple branches to develop object-aware region modeling, where each branch focuses on a crucial object mentioned in the sentence. We then propose multi-branch relation reasoning to capture critical object relationships between the main branch and auxiliary branches. Moreover, we apply a diversity loss to make each branch only pay attention to its corresponding object and boost multi-branch learning. The extensive experiments show the effectiveness of our proposed method.
doi:10.24963/ijcai.2020/149 dblp:conf/ijcai/ZhangZLHY20 fatcat:4yux7bufpzeqpeexjwfux4tubq