A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2022; you can also visit the original URL.
The file type is application/pdf
.
Modality Shifting Attention Network for Multi-Modal Video Question Answering
2020
2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
This paper considers a network referred to as Modality Shifting Attention Network (MSAN) for Multimodal Video Question Answering (MVQA) task. MSAN decomposes the task into two sub-tasks: (1) localization of temporal moment relevant to the question, and (2) accurate prediction of the answer based on the localized moment. The modality required for temporal localization may be different from that for answer prediction, and this ability to shift modality is essential for performing the task. To
doi:10.1109/cvpr42600.2020.01012
dblp:conf/cvpr/KimMPKY20
fatcat:cobov5g4nbdthjj3r4x6i5bp3a