Filters








9 Hits in 2.6 sec

TransVG: End-to-End Visual Grounding with Transformers [article]

Jiajun Deng, Zhengyuan Yang, Tianlang Chen, Wengang Zhou, Houqiang Li
2022 arXiv   pre-print
In this paper, we present a neat yet effective transformer-based framework for visual grounding, namely TransVG, to address the task of grounding a language query to the corresponding region onto an image  ...  We build the benchmark of transformer-based visual grounding framework and make the code available at .  ...  The whole architecture of our TransVG is end-to-end optimized with AdamW optimizer.  ... 
arXiv:2104.08541v4 fatcat:uatyv5tmn5hgnibffge76beu3y

Improving Visual Grounding with Visual-Linguistic Verification and Iterative Reasoning [article]

Li Yang, Yan Xu, Chunfeng Yuan, Wei Liu, Bing Li, Weiming Hu
2022 arXiv   pre-print
They base the visual grounding on the features from pre-generated proposals or anchors, and fuse these features with the text embeddings to locate the target mentioned by the text.  ...  In this paper, we propose a transformer-based framework for accurate visual grounding by establishing text-conditioned discriminative features and performing multi-stage cross-modal reasoning.  ...  Motivated by that, TransVG [5] proposes a transformer-based framework for visual grounding.  ... 
arXiv:2205.00272v1 fatcat:n4aqffqmprgxffcovw6vp6swje

Referring Expression Comprehension via Cross-Level Multi-Modal Fusion [article]

Peihan Miao, Wei Su, Lian Wang, Yongjian Fu, Xi Li
2022 arXiv   pre-print
To this end, we design a Cross-level Multi-modal Fusion (CMF) framework, which gradually integrates visual and textual features of multi-layer through intra- and inter-modal.  ...  Considering that REC requires visual and textual hierarchical information for accurate target localization, and encoders inherently extract features in a hierarchical fashion, we propose to effectively  ...  We set 𝑁 learnable parameters and optimize them end-to-end with the framework.  ... 
arXiv:2204.09957v1 fatcat:ypnf7lj6cfa5hlvlhmrgn4ut2m

Pseudo-Q: Generating Pseudo Language Queries for Visual Grounding [article]

Haojun Jiang, Yuanze Lin, Dongchen Han, Shiji Song, Gao Huang
2022 arXiv   pre-print
Visual grounding, i.e., localizing objects in images according to natural language queries, is an important topic in visual language understanding.  ...  Then, we design a task-related query prompt module to specifically tailor generated pseudo language queries for visual grounding tasks.  ...  All our experiments are conducted under Pytorch framework [43] with 8 RTX3090 GPUs. Our visual-language model is end-to-end optimized with AdamW.  ... 
arXiv:2203.08481v2 fatcat:rqbg4kktjvelzpuxqzwcqut4iq

SeqTR: A Simple yet Universal Network for Visual Grounding [article]

Chaoyang Zhu, Yiyi Zhou, Yunhang Shen, Gen Luo, Xingjia Pan, Mingbao Lin, Chao Chen, Liujuan Cao, Xiaoshuai Sun, Rongrong Ji
2022 arXiv   pre-print
The canonical paradigms for visual grounding often require substantial expertise in designing network architectures and loss functions, making them hard to generalize across tasks.  ...  To simplify and unify the modeling, we cast visual grounding as a point prediction problem conditioned on image and text inputs, where either the bounding box or binary mask is represented as a sequence  ...  For transformer-based models, SeqTR surpasses TransVG [7] and TRAR [65] with up to 6.27% absolute performance improvement.  ... 
arXiv:2203.16265v1 fatcat:24acknsvdfhvxbcmcvnoatlfjy

FindIt: Generalized Localization with Natural Language Queries [article]

Weicheng Kuo, Fred Bertsch, Wei Li, AJ Piergiovanni, Mohammad Saffar, Anelia Angelova
2022 arXiv   pre-print
Our end-to-end trainable framework responds flexibly and accurately to a wide range of referring expression, localization or detection queries for zero, one, or multiple objects.  ...  We propose FindIt, a simple and versatile framework that unifies a variety of visual grounding and localization tasks including referring expression comprehension, text-based localization, and object detection  ...  In short, FindIt is a simple, efficient, and end-to-end trainable model for unified visual grounding and object detection.  ... 
arXiv:2203.17273v1 fatcat:5pmmjbc3n5d5lg7od7ec6gvlpy

What Goes beyond Multi-modal Fusion in One-stage Referring Expression Comprehension: An Empirical Study [article]

Gen Luo, Yiyi Zhou, Jiamu Sun, Shubin Huang, Xiaoshuai Sun, Qixiang Ye, Yongjian Wu, Rongrong Ji
2022 arXiv   pre-print
that run counter to conventional understanding.  ...  To fill this gap, we conduct an empirical study in this paper.  ...  Compared with the traditional object detection task [13] , [14] , REC is not limited to a fixed set of object categories and can theoretically perform any open-ended detection according to the text description  ... 
arXiv:2204.07913v1 fatcat:nbwvh7j4b5ga5etbhed4klj65q

A Survey of Visual Transformers [article]

Yang Liu, Yao Zhang, Yixin Wang, Feng Hou, Jin Yuan, Jiang Tian, Yang Zhang, Zhongchao Shi, Jianping Fan, Zhiqiang He
2022 arXiv   pre-print
to bridge the gap between the visual Transformers and the sequential ones.  ...  Because of their competitive modeling capabilities, the visual Transformers have achieved impressive performance improvements over multiple benchmarks as compared with modern Convolution Neural Networks  ...  End-to-End Object Detection with Transformers 15 Multi head attention Fig. 8 : Illustration of the panoptic head.  ... 
arXiv:2111.06091v3 fatcat:a3fq6lvvzzgglb3qtus5qwrwpe

Transformers in Vision: A Survey [article]

Salman Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, Fahad Shahbaz Khan, Mubarak Shah
2021 arXiv   pre-print
, multi-modal tasks (e.g., visual-question answering, visual reasoning, and visual grounding), video processing (e.g., activity recognition, video forecasting), low-level vision (e.g., image super-resolution  ...  We start with an introduction to fundamental concepts behind the success of Transformers i.e., self-attention, large-scale pre-training, and bidirectional encoding.  ...  We would also like to thank Mohamed Afham for his help with a figure.  ... 
arXiv:2101.01169v4 fatcat:ynsnfuuaize37jlvhsdki54cy4