A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2022; you can also visit the original URL.
The file type is application/pdf
.
RDMMFET: Representation of Dense Multimodality Fusion Encoder Based on Transformer
2021
Mobile Information Systems
Visual question answering (VQA) is the natural language question-answering of visual images. The model of VQA needs to make corresponding answers according to specific questions based on understanding images, the most important of which is to understand the relationship between images and language. Therefore, this paper proposes a new model, Representation of Dense Multimodality Fusion Encoder Based on Transformer, for short, RDMMFET, which can learn the related knowledge between vision and
doi:10.1155/2021/2662064
fatcat:5qxdumw5hzhutb4txpzuy4rqlu