A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2022; you can also visit the original URL.
The file type is application/pdf
.
Global Fusion Attention for Vision and Language Understanding (Student Abstract)
2021
AAAI Conference on Artificial Intelligence
We extend the popular transformer architecture to a multimodal model, processing both visual and textual inputs. We propose a new attention mechanism on Transformer-based architecture for the joint vision and language understanding tasks. Our model fuses multi-level comprehension between images and texts in a weighted manner, which could better curve the internal relationships. Experiments on benchmark VQA dataset CLEVR demonstrate the effectiveness of the proposed attention mechanism. We also
dblp:conf/aaai/GuoLWB21
fatcat:j3s4upai7ba2vngigdcncunrkm