Reasoning with Heterogeneous Graph Alignment for Video Question Answering

Pin Jiang, Yahong Han
2020 PROCEEDINGS OF THE THIRTIETH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE AND THE TWENTY-EIGHTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE  
The dominant video question answering methods are based on fine-grained representation or model-specific attention mechanism. They usually process video and question separately, then feed the representations of different modalities into following late fusion networks. Although these methods use information of one modality to boost the other, they neglect to integrate correlations of both inter- and intra-modality in an uniform module. We propose a deep heterogeneous graph alignment network over
more » ... the video shots and question words. Furthermore, we explore the network architecture from four steps: representation, fusion, alignment, and reasoning. Within our network, the inter- and intra-modality information can be aligned and interacted simultaneously over the heterogeneous graph and used for cross-modal reasoning. We evaluate our method on three benchmark datasets and conduct extensive ablation study to the effectiveness of the network architecture. Experiments show the network to be superior in quality.
doi:10.1609/aaai.v34i07.6767 fatcat:kbte5ijo4fh53ngz7bby2uefa4