Filters








11 Hits in 2.1 sec

Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA [article]

Ronghang Hu, Amanpreet Singh, Trevor Darrell, Marcus Rohrbach
2020 arXiv   pre-print
Furthermore, it enables iterative answer decoding with a dynamic pointer network, allowing the model to form an answer through multi-step prediction instead of one-step classification.  ...  In this work, we propose a novel model for the TextVQA task based on a multimodal transformer architecture accompanied by a rich representation for text in images.  ...  [44] ) on the TextVQA task, our model, accompanied by rich features for image text, handles all modalities with a multimodal transformer over a joint embedding space instead of pairwise fusion mechanisms  ... 
arXiv:1911.06258v3 fatcat:c4zkfcdk2bgkdiljbeig6dcedq

Towards Escaping from Language Bias and OCR Error: Semantics-Centered Text Visual Question Answering [article]

Chengyang Fang, Gangyan Zeng, Yu Zhou, Daiqing Wu, Can Ma, Dayong Hu, Weiping Wang
2022 arXiv   pre-print
Extensive experiments on TextVQA and ST-VQA datasets show the effectiveness of our model. SC-Net surpasses previous works with a noticeable margin and is more reasonable for the TextVQA task.  ...  In this paper, we propose a novel Semantics-Centered Network (SC-Net) that consists of an instance-level contrastive semantic prediction module (ICSP) and a semantics-centered transformer module (SCT).  ...  To solve this problem, we propose a SCT which focuses on the semantic information in the multimodal fusion stage.  ... 
arXiv:2203.12929v1 fatcat:kxtbpm4jazhjdeasa5e3r624te

Finding the Evidence: Localization-aware Answer Prediction for Text Visual Question Answering [article]

Wei Han and Hantao Huang and Tao Han
2020 arXiv   pre-print
Moreover, a context-enriched OCR representation (COR) for multimodal fusion is proposed to facilitate the localization task.  ...  As such, this paper proposes a localization-aware answer prediction network (LaAP-Net) to address this challenge.  ...  Besides, we propose a multimodal fusion module with context-enriched OCR representation, which uses a novel position-guided attention to integrate context object features into OCR representation.  ... 
arXiv:2010.02582v1 fatcat:uet3xdoftfetjficvjki2ixkdi

Structured Multimodal Attentions for TextVQA [article]

Chenyu Gao and Qi Zhu and Peng Wang and Hui Li and Yuliang Liu and Anton van den Hengel and Qi Wu
2021 arXiv   pre-print
SMA first uses a structural graph representation to encode the object-object, object-text and text-text relationships appearing in the image, and then designs a multimodal graph attention network to reason  ...  In this paper, we propose an end-to-end structured multimodal attention (SMA) neural network to mainly solve the first two issues above.  ...  In this paper, we propose a structured multimodal attention (SMA) neural network to solve the above issues.  ... 
arXiv:2006.00753v2 fatcat:4lk4yloglnhdxftkzrrlcs3ztq

Simple is not Easy: A Simple Strong Baseline for TextVQA and TextCaps [article]

Qi Zhu, Chenyu Gao, Peng Wang, Qi Wu
2020 arXiv   pre-print
In this paper, we argue that a simple attention mechanism can do the same or even better job without any bells and whistles.  ...  Two tasks -- text-based visual question answering and text-based image captioning, with a text extension from existing vision-language applications, are catching on rapidly.  ...  We use the same model on TextVQA and three tasks of ST-VQA, only with different answer vocabulary, both with a fixed size of 5000.  ... 
arXiv:2012.05153v1 fatcat:2nmziuo7cvbizcrn42wjn5szsq

Multi-Modal Graph Neural Network for Joint Reasoning on Vision and Scene Text [article]

Difei Gao, Ke Li, Ruiping Wang, Shiguang Shan, Xilin Chen
2020 arXiv   pre-print
Following this idea, we propose a novel VQA approach, Multi-Modal Graph Neural Network (MM-GNN).  ...  Answering questions that require reading texts in an image is challenging for current models.  ...  These works [45, 36, 32, 21, 47] propose GNN with language conditioned aggregator to dynamically locate a subgraph of the scene for a given query (e.g. a referring expression or a question), then GNN  ... 
arXiv:2003.13962v1 fatcat:ifofez5zjjdlrf2i7iptppe4oe

Multimodal grid features and cell pointers for Scene Text Visual Question Answering [article]

Lluís Gómez, Ali Furkan Biten, Rubèn Tito, Andrés Mafla, Marçal Rusiñol, Ernest Valveny, Dimosthenis Karatzas
2020 arXiv   pre-print
This paper presents a new model for the task of scene text visual question answering, in which questions about a given image can only be answered by reading and understanding scene text that is present  ...  The output weights of this attention module over the grid of multi-modal spatial features are interpreted as the probability that a certain spatial location of the image contains the answer text the to  ...  A potential issue with this dual attention is that it makes difficult for the model to reason jointly about the two modalities, since this can only be done after the late fusion of the outputs of the two  ... 
arXiv:2006.00923v2 fatcat:px6mb3b34nc6diwdfssyv6kdoi

Localize, Group, and Select: Boosting Text-VQA by Scene Text Modeling [article]

Xiaopeng Lu, Zhen Fan, Yansen Wang, Jean Oh, Carolyn P. Rose
2021 arXiv   pre-print
In this paper, we propose Localize, Group, and Select (LOGOS), a novel model which attempts to tackle this problem from multiple aspects.  ...  As an important task in multimodal context understanding, Text-VQA (Visual Question Answering) aims at question answering through reading text information in images.  ...  Also, it attaches a pointer network that can dynamically copy words from OCR systems.  ... 
arXiv:2108.08965v1 fatcat:duecadpwzfg3tiienqhiucszya

Finding the Evidence: Localization-aware Answer Prediction for Text Visual Question Answering

Wei Han, Hantao Huang, Tao Han
2020 Proceedings of the 28th International Conference on Computational Linguistics   unpublished
Moreover, a context-enriched OCR representation (COR) for multimodal fusion is proposed to facilitate the localization task.  ...  As such, this paper proposes a localizationaware answer prediction network (LaAP-Net) to address this challenge.  ...  Besides, we propose a multimodal fusion module with context-enriched OCR representation, which uses a novel position-guided attention to integrate context object features into OCR representation.  ... 
doi:10.18653/v1/2020.coling-main.278 fatcat:237oup234jgpjhm75xhja4xgu4

ICDAR 2021 Competition on Document VisualQuestion Answering [article]

Rubèn Tito, Minesh Mathew, C.V. Jawahar, Ernest Valveny, Dimosthenis Karatzas
2021 arXiv   pre-print
This edition complements the previous tasks on Single Document VQA and Document Collection VQA with a newly introduced on Infographics VQA.  ...  We present a summary of the datasets used for each task, description of each of the submitted methods and the results and analysis of their performance.  ...  Manmatha for many useful inputs and discussions.  ... 
arXiv:2111.05547v1 fatcat:ioftl7jhffhaxhf7bsvqftmqxm

Towards Accurate Text-based Image Captioning with Content Diversity Exploration [article]

Guanghui Xu, Shuaicheng Niu, Mingkui Tan, Yucheng Luo, Qing Du, Qi Wu
2021 arXiv   pre-print
Text-based image captioning (TextCap) which aims to read and reason images with texts is crucial for a machine to understand a detailed and complex scene environment, considering that texts are omnipresent  ...  To conquer these, we propose a novel Anchor-Captioner method. Specifically, we first find the important tokens which are supposed to be paid more attention to and consider them as anchors.  ...  (8) , f dp denotes the dynamic pointer network [19] that makes prediction based on the G and y c .  ... 
arXiv:2105.03236v1 fatcat:37pqn7apyndxpb6bt7wl6j3fqm