Filters








9,474 Hits in 4.5 sec

Uncovering the Temporal Context for Video Question Answering

Linchao Zhu, Zhongwen Xu, Yi Yang, Alexander G. Hauptmann
2017 International Journal of Computer Vision  
We present an encoder-decoder approach using Recurrent Neural Networks to learn temporal structures of videos and introduce a dual-channel ranking loss to answer multiple-choice questions.  ...  In this work, we introduce Video Question Answering in temporal domain to infer the past, describe the present and predict the future.  ...  We utilize an encoder-decoder model trained in an unsupervised way for visual context learning and propose a dual-channel learning to ranking method to answer questions.  ... 
doi:10.1007/s11263-017-1033-7 fatcat:5or4ebm2inbc7faqhxzklsvnaq

DMRM: A Dual-channel Multi-hop Reasoning Model for Visual Dialog [article]

Feilong Chen, Fandong Meng, Jiaming Xu, Peng Li, Bo Xu, Jie Zhou
2019 arXiv   pre-print
In this paper, we thus propose a novel and more powerful Dual-channel Multi-hop Reasoning Model for Visual Dialog, named DMRM.  ...  DMRM synchronously captures information from the dialog history and the image to enrich the semantic representation of the question by exploiting dual-channel reasoning.  ...  Therefore, in this paper, we propose a Dual-channel Multi-hop Reasoning Model for Visual Dialog, named DMRM.  ... 
arXiv:1912.08360v1 fatcat:3zzq7y3u3ncj5kqflpbytlo6gu

DMRM: A Dual-Channel Multi-Hop Reasoning Model for Visual Dialog

Feilong Chen, Fandong Meng, Jiaming Xu, Peng Li, Bo Xu, Jie Zhou
2020 PROCEEDINGS OF THE THIRTIETH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE AND THE TWENTY-EIGHTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE  
In this paper, we thus propose a novel and more powerful Dual-channel Multi-hop Reasoning Model for Visual Dialog, named DMRM.  ...  DMRM synchronously captures information from the dialog history and the image to enrich the semantic representation of the question by exploiting dual-channel reasoning.  ...  Therefore, in this paper, we propose a Dual-channel Multi-hop Reasoning Model for Visual Dialog, named DMRM.  ... 
doi:10.1609/aaai.v34i05.6248 fatcat:cjpdcpp3rrdnnawkkqm4cdsgle

From Pixels to Objects: Cubic Visual Attention for Visual Question Answering

Jingkuan Song, Pengpeng Zeng, Lianli Gao, Heng Tao Shen
2018 Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence  
Recently, attention-based Visual Question Answering (VQA) has achieved great success by utilizing question to selectively target different visual areas that are related to the answer.  ...  Finally, the attended visual features and the question are combined to infer the answer.  ...  Acknowledgments This work is supported by the Fundamental Research Funds for the Central Universities (Grant No. ZYGX2014J063, No.  ... 
doi:10.24963/ijcai.2018/126 dblp:conf/ijcai/SongZGS18 fatcat:qzfcs6si2fdhfdj3qxemyuzkwq

From Pixels to Objects: Cubic Visual Attention for Visual Question Answering [article]

Jingkuan Song, Pengpeng Zeng, Lianli Gao, Heng Tao Shen
2022 arXiv   pre-print
Recently, attention-based Visual Question Answering (VQA) has achieved great success by utilizing question to selectively target different visual areas that are related to the answer.  ...  Finally, the attended visual features and the question are combined to infer the answer.  ...  Acknowledgments This work is supported by the Fundamental Research Funds for the Central Universities (Grant No. ZYGX2014J063, No.  ... 
arXiv:2206.01923v1 fatcat:dwa7hdapxnah7l77nxirfm5coi

Embodied Multimodal Multitask Learning [article]

Devendra Singh Chaplot, Lisa Lee, Ruslan Salakhutdinov, Devi Parikh, Dhruv Batra
2019 arXiv   pre-print
goal navigation and embodied question answering.  ...  Recent efforts on training visual navigation agents conditioned on language using deep reinforcement learning have been successful in learning policies for different multimodal tasks, such as semantic  ...  Visual Question Answering and Zhao et al. (2018) for grounding audio to vision.  ... 
arXiv:1902.01385v1 fatcat:am4tay55ibhgtlgd3osvdhpkci

Dual-Channel Reasoning Model for Complex Question Answering

Xing Cao, Yun Liu, Bo Hu, Yu Zhang, Xuzhen Zhu
2021 Complexity  
In this paper, we propose a dual-channel reasoning architecture, where two reasoning channels predict the final answer and supporting facts' sentences, respectively, while sharing the contextual embedding  ...  The two reasoning channels can simply use the same reasoning structure without additional network designs.  ...  In this paper, a dual-channel reasoning architecture is designed for complex question answering.  ... 
doi:10.1155/2021/7367181 fatcat:pk33ybw7ufcarp4dww355mzyuq

Enabling Harmonious Human-Machine Interaction with Visual-Context Augmented Dialogue System: A Review [article]

Hao Wang, Bin Guo, Yating Zeng, Yasan Ding, Chen Qiu, Ying Zhang, Lina Yao, Zhiwen Yu
2022 arXiv   pre-print
cross-modal semantic interaction.  ...  Consequently, Visual Context Augmented Dialogue System (VAD), which has the potential to communicate with humans by perceiving and understanding multimodal information (i.e., visual context in images or  ...  ACKNOWLEDGMENTS This work was partially supported by the National Science Fund for Distinguished Young Scholars (62025205), and the National Natural Science Foundation of China (No. 62032020, 61960206008  ... 
arXiv:2207.00782v1 fatcat:a57laj75xfa43gg4hjvxdh4c4i

Cross-Modal Causal Relational Reasoning for Event-Level Visual Question Answering [article]

Yang Liu, Guanbin Li, Liang Lin
2022 arXiv   pre-print
Specifically, we propose a novel event-level visual question answering framework named Cross-Modal Causal RelatIonal Reasoning (CMCIR), to achieve robust casuality-aware visual-linguistic question answering  ...  Existing visual question answering methods tend to capture the spurious correlations from visual and linguistic modalities, and fail to discover the true casual mechanism that facilitates reasoning truthfully  ...  The only existing work for event-level urban visual question answering is Eclipse [24] , which built an event-level urban traffic visual question answering dataset and proposed an efficient glimpse network  ... 
arXiv:2207.12647v1 fatcat:quvke2bkjzce5j5rrymodzrgui

Triple Attention Network architecture for MovieQA [article]

Ankit Shah, Tzu-Hsiang Lin, Shijie Wu
2021 arXiv   pre-print
Movie question answering, or MovieQA is a multimedia related task wherein one is provided with a video, the subtitle information, a question and candidate answers for it.  ...  The task is to predict the correct answer for the question using the components of the multimedia - namely video/images, audio and text.  ...  Dual Attention Networks Dual Attention Network(DAN) was proposed for Visual Question Answering [3] and Image Text Matching.  ... 
arXiv:2111.09531v1 fatcat:zq7qwr6srveelpoljqzlihobei

Multi-modal Memory Enhancement Attention Network for Image-Text Matching

Zhong Ji, Zhigang Lin, Haoran Wang, Yuqing He
2020 IEEE Access  
from both perspectives of fragment and channel.  ...  The key element to narrow the "heterogeneity gap" between visual and textual data lies in how to learn powerful and robust representations for both modalities.  ...  Three trucks towing travel trailers with ATV's in the truck bed. visual question answering task.  ... 
doi:10.1109/access.2020.2975594 fatcat:ciiubythzzevpkw2ip5csnjwf4

DualAttn-GAN: Text to Image Synthesis with Dual Attentional Generative Adversarial Network

Yali Cai, Xiaoru Wang, Zhihong Yu, Fu Li, Peirong Xu, Yueli Li, Lixian Li
2019 IEEE Access  
It is due to ineffectiveness of convolutional neural networks in capturing the high-level semantic information for pixel-level image synthesis.  ...  On the other hand, visual attention module models internal representations of vision from channel and spatial axes, which can better capture the global structures.  ...  question answering [39] - [41] .  ... 
doi:10.1109/access.2019.2958864 fatcat:gj72fnk6yfhmbnld2eys7jkjsy

Temporally Multi-Modal Semantic Reasoning with Spatial Language Constraints for Video Question Answering

Mingyang Liu, Ruomei Wang, Fan Zhou, Ge Lin
2022 Symmetry  
Video question answering (QA) aims to understand the video scene and underlying plot by answering video questions.  ...  Specifically, for a question, the result processed by the spatial language constraints module is to obtain visual clues related to the question from a single image and filter out unwanted spatial information  ...  ] networks to obtain visual semantic clues.  ... 
doi:10.3390/sym14061133 fatcat:jnjqzzpst5abtjduofkpzpyctu

R-VQA

Pan Lu, Lei Ji, Wei Zhang, Nan Duan, Ming Zhou, Jianyong Wang
2018 Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining - KDD '18  
To better utilize semantic knowledge in images, we propose a novel framework to learn visual relation facts for VQA.  ...  Recently, Visual Question Answering (VQA) has emerged as one of the most significant tasks in multimodal learning as it requires understanding both visual and textual modalities.  ...  ACKNOWLEDGMENTS We would like to thank our anonymous reviewers for their constructive feedback and suggestions. This work was supported in part by the National Natural Science  ... 
doi:10.1145/3219819.3220036 dblp:conf/kdd/LuJZDZW18 fatcat:jnqklx52mrgobegy5h7nrkbipq

VL-BEiT: Generative Vision-Language Pretraining [article]

Hangbo Bao, Wenhui Wang, Li Dong, Furu Wei
2022 arXiv   pre-print
Experimental results show that VL-BEiT obtains strong results on various vision-language benchmarks, such as visual question answering, visual reasoning, and image-text retrieval.  ...  Moreover, our method learns transferable visual features, achieving competitive performance on image classification, and semantic segmentation.  ...  Acknowledgement We would like to acknowledge Zhiliang Peng for the helpful discussions.  ... 
arXiv:2206.01127v1 fatcat:zcrtq6kh3zeopklymmytarqzka
« Previous Showing results 1 — 15 out of 9,474 results