57,717 Hits in 6.1 sec

A Simple Single-Scale Vision Transformer for Object Localization and Instance Segmentation [article]

Wuyang Chen, Xianzhi Du, Fan Yang, Lucas Beyer, Xiaohua Zhai, Tsung-Yi Lin, Huizhong Chen, Jing Li, Xiaodan Song, Zhangyang Wang, Denny Zhou
2021 arXiv   pre-print
This work presents a simple vision transformer design as a strong baseline for object localization and instance segmentation tasks.  ...  performance on COCO object detection and instance segmentation tasks.  ...  Conclusion In this work, we present a simple, single-scale vision transformer backbone that can serve as a strong baseline for object detection and instance segmentation.  ... 
arXiv:2112.09747v1 fatcat:wyixjj5rzrh6zb2shr3v64karq

Bottleneck Transformers for Visual Recognition [article]

Aravind Srinivas, Tsung-Yi Lin, Niki Parmar, Jonathon Shlens, Pieter Abbeel, Ashish Vaswani
2021 arXiv   pre-print
We present BoTNet, a conceptually simple yet powerful backbone architecture that incorporates self-attention for multiple computer vision tasks including image classification, object detection and instance  ...  We hope our simple and effective approach will serve as a strong baseline for future research in self-attention models for vision  ...  Houlsby, Alexey Dosovitskiy for feedback.  ... 
arXiv:2101.11605v2 fatcat:n4d4lfqg5ze4hgq7q7oe5zblna

MPViT: Multi-Path Vision Transformer for Dense Prediction [article]

Youngwan Lee, Jonghee Kim, Jeff Willette, Sung Ju Hwang
2021 arXiv   pre-print
classification, object detection, instance segmentation, and semantic segmentation.  ...  Similar to CNNs, ViTs build a simple multi-stage structure (i.e., fine-to-coarse) for multi-scale representation with single-scale patches.  ...  Object Detection and Instance Segmentation Setting.  ... 
arXiv:2112.11010v2 fatcat:mk6uiqrinbborohzy7s2jj7cri

Content-Aware Multi-Level Guidance for Interactive Instance Segmentation

Soumajit Majumder, Angela Yao
2019 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)  
Guidance maps used in current systems are purely distance-based and are either too localized or non-informative.  ...  In interactive instance segmentation, users give feedback to iteratively refine segmentation masks.  ...  Gaussian-and Euclidean distance maps are primarily used for localizing the user clicks and do not account for the object scale.  ... 
doi:10.1109/cvpr.2019.01187 dblp:conf/cvpr/MajumderY19 fatcat:pyogazzvkrb4piyywufb4l2rdm

DeepLab2: A TensorFlow Library for Deep Labeling [article]

Mark Weber, Huiyu Wang, Siyuan Qiao, Jun Xie, Maxwell D. Collins, Yukun Zhu, Liangzhe Yuan, Dahun Kim, Qihang Yu, Daniel Cremers, Laura Leal-Taixe, Alan L. Yuille (+3 others)
2021 arXiv   pre-print
DeepLab2 is a TensorFlow library for deep labeling, aiming to provide a state-of-the-art and easy-to-use TensorFlow codebase for general dense pixel prediction problems in computer vision.  ...  To showcase the effectiveness of DeepLab2, our Panoptic-DeepLab employing Axial-SWideRNet as network backbone achieves 68.0 single-scale inference and ImageNet-1K pretrained checkpoints.  ...  We would like to thank Michalis Raptis for the feedback on the paper, Jiquan Ngiam and Amil Merchant for Hungarian Matching implementation, and the support from Google Mobile Vision.  ... 
arXiv:2106.09748v1 fatcat:lbly3ld4zjc5rpq6l7l6bzkko4

Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions [article]

Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, Ling Shao
2021 arXiv   pre-print
, showing that it boosts the performance of many downstream tasks, e.g., object detection, semantic, and instance segmentation.  ...  Although using convolutional neural networks (CNNs) as backbones achieves great successes in computer vision, this work investigates a simple backbone network useful for many dense prediction tasks without  ...  Here, we compare the performance of our PVT (pure Transformer) and GCNet (CNN w/ non-local), using Mask R-CNN for instance segmentation.  ... 
arXiv:2102.12122v2 fatcat:45k4yxcah5bflpff43phhgxr3i

What Makes for Hierarchical Vision Transformer? [article]

Yuxin Fang, Xinggang Wang, Rui Wu, Wenyu Liu
2021 arXiv   pre-print
tasks such as object detection and instance segmentation.  ...  In this manuscript, we question whether self-attention is the only choice for hierarchical Vision Transformer to attain strong performance, and the effects of different kinds of cross-window communication  ...  To more efficiently apply Vision Transformers to other downstream tasks in computer vision such as object detection, instance segmentation, and scene parsing, three key issues need to be solved: (1) involving  ... 
arXiv:2107.02174v2 fatcat:e5qv4mzap5bpxnfnqylvorzz54

3rd Place Scheme on Instance Segmentation Track of ICCV 2021 VIPriors Challenges [article]

Pengyu Chen, Wanhua Li
2021 arXiv   pre-print
We only use a single GPU during the whole training and testing stages.  ...  In this paper, we introduce a data-efficient instance segmentation method we used in the 2021 VIPriors Instance Segmentation Challenge.  ...  The dense local regression and a discriminative RoI pooling scheme are introduced to achieve high-quality object detection and instance segmentation.  ... 
arXiv:2110.00242v3 fatcat:6jhogrdz65g53hz43jyc4ykbem

Transformers in Vision: A Survey [article]

Salman Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, Fahad Shahbaz Khan, Mubarak Shah
2021 arXiv   pre-print
We then cover extensive applications of transformers in vision including popular recognition tasks (e.g., image classification, object detection, action recognition, and segmentation), generative modeling  ...  These strengths have led to exciting progress on a number of vision tasks using Transformer networks.  ...  We would also like to thank Mohamed Afham for his help with a figure.  ... 
arXiv:2101.01169v4 fatcat:ynsnfuuaize37jlvhsdki54cy4

Container: Context Aggregation Network [article]

Peng Gao, Jiasen Lu, Hongsheng Li, Roozbeh Mottaghi, Aniruddha Kembhavi
2021 arXiv   pre-print
In contrast to Transformer-based methods that do not scale well to downstream tasks that rely on larger input image resolutions, our efficient network, named , can be employed in object detection and instance  ...  Convolutional neural networks (CNNs) are ubiquitous in computer vision, with a myriad of effective and efficient variations.  ...  In computer vision, Non-local Neural Network [61] has been proposed to capture long range interactions to compensate for the local information captured by CNNs and used for object detection [27] and  ... 
arXiv:2106.01401v2 fatcat:qxj3bdes2jh4dnibkdzamtw2pq

Segmenter: Transformer for Semantic Segmentation [article]

Robin Strudel, Ricardo Garcia, Ivan Laptev, Cordelia Schmid
2021 arXiv   pre-print
We build on the recent Vision Transformer (ViT) and extend it to semantic segmentation.  ...  In this paper we introduce Segmenter, a transformer model for semantic segmentation.  ...  Finally, our largest Segmenter model, Seg-L/16, achieves a strong mIoU of 50.71% with a simple decoding scheme on the ADE20K validation dataset with single scale inference.  ... 
arXiv:2105.05633v3 fatcat:si63vqmphvdktblgs6dnqnkweu

Multiscale object representation using surface patches

Wassim Alami, Gregory Dudek, David P. Casasent
1994 Intelligent Robots and Computer Vision XIII: Algorithms and Computer Vision  
The extraction of simple uniform curvature features is limited by the fact that the optimal scale of processing for a single object is very di cult to determine.  ...  As a solution we propose the segmentation of range data into patches at multiple scales.  ...  ACKNOWLEDGEMENTS The authors gratefully acknowledge the nancial support of the Natural Sciences and Engineering Research Council and the Canadian Federal Centres of Excellence Program.  ... 
doi:10.1117/12.188884 fatcat:6hmtbrqpjzbopkzlok7ac3ttge

Recent Advances in Vision Transformer: A Survey and Outlook of Recent Work [article]

Khawar Islam
2022 arXiv   pre-print
Vision Transformers (ViTs) are becoming more popular and dominating technique for various vision tasks, compare to Convolutional Neural Networks (CNNs).  ...  As a demanding technique in computer vision, ViTs have been successfully solved various vision problems while focusing on long-range relationships.  ...  [15] 3D Object Detection Point clouds backbone and local and global context [16] Medical Image Segmentation Convolution for features and long-range association [17] Object Goal Navigation Object instances  ... 
arXiv:2203.01536v2 fatcat:m26h7ll4xzeylmdoezftbel52m

SDTP: Semantic-aware Decoupled Transformer Pyramid for Dense Image Prediction [article]

Zekun Li, Yufan Liu, Bing Li, Weiming Hu, Kebin Wu, Pei Wang
2021 arXiv   pre-print
Although transformer has achieved great progress on computer vision tasks, the scale variation in dense image prediction is still the key challenge.  ...  Inspired by these findings, we propose a novel Semantic-aware Decoupled Transformer Pyramid (SDTP) for dense image prediction, consisting of Intra-level Semantic Promotion (ISP), Cross-level Decoupled  ...  object detection, semantic segmentation and instance segmentation.  ... 
arXiv:2109.08963v1 fatcat:a4yzyzcfyjhbpby6rgkvntr5ny

Pale Transformer: A General Vision Transformer Backbone with Pale-Shaped Attention [article]

Sitong Wu, Tianyi Wu, Haoru Tan, Guodong Guo
2021 arXiv   pre-print
For downstream tasks, our Pale Transformer backbone performs better than the recent state-of-the-art CSWin Transformer by a large margin on ADE20K semantic segmentation and COCO object detection & instance  ...  size of 22M, 48M, and 85M respectively for 224 ImageNet-1K classification, outperforming the previous Vision Transformer backbones.  ...  , object de- We compare the performance of our Pale Transformer back- tection, and instance segmentation, respectively.  ... 
arXiv:2112.14000v1 fatcat:mhn3mkrdwner7eswgpklwvip6u
« Previous Showing results 1 — 15 out of 57,717 results