Video Question Answering via Hierarchical Dual-Level Attention Network Learning

Zhou Zhao, Jinghao Lin, Xinghua Jiang, Deng Cai, Xiaofei He, Yueting Zhuang
2017 Proceedings of the 2017 ACM on Multimedia Conference - MM '17  
Video question answering is a challenging task in visual information retrieval, which provides the accurate answer from the referenced video contents according to the given question. However, the existing visual question answering approaches mainly tackle the problem of static image question answering, which may be ineffectively applied for video question answering directly, due to the insufficiency of modeling the video temporal dynamics. In this paper, we study the problem of video question
more » ... swering from the viewpoint of hierarchical dual-level attention network learning. We obtain the object appearance and movement information in the video based on both frame-level and segment-level feature representation methods. We then develop the hierarchical duallevel attention networks to learn the question-aware video representations with word-level and question-level attention mechanisms. We next devise the question-level fusion attention mechanism for our proposed networks to learn the questionaware joint video representation for video question answering. We construct two large-scale video question answering datasets. The extensive experiments validate the effectiveness of our method.
doi:10.1145/3123266.3123364 dblp:conf/mm/ZhaoLJCHZ17 fatcat:vjag7l4gsbdjzcidiuhchdshsm