A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2020; you can also visit the original URL.
The file type is application/pdf
.
Simple is not Easy: A Simple Strong Baseline for TextVQA and TextCaps
[article]
2020
arXiv
pre-print
Texts appearing in daily scenes that can be recognized by OCR (Optical Character Recognition) tools contain significant information, such as street name, product brand and prices. Two tasks -- text-based visual question answering and text-based image captioning, with a text extension from existing vision-language applications, are catching on rapidly. To address these problems, many sophisticated multi-modality encoding frameworks (such as heterogeneous graph structure) are being used. In this
arXiv:2012.05153v1
fatcat:2nmziuo7cvbizcrn42wjn5szsq