489,981 Hits in 2.4 sec

Scaling Vision Transformers [article]

Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, Lucas Beyer
2021 arXiv   pre-print
While the laws for scaling Transformer language models have been studied, it is unknown how Vision Transformers scale.  ...  Attention-based neural networks such as the Vision Transformer (ViT) have recently attained state-of-the-art results on many computer vision benchmarks.  ...  We are the first to scale Vision Transformers to even larger size and reache new state-of-the-art results doing so. 5 Discussion We demonstrate that the performance-compute frontier for Vision Transformer  ... 
arXiv:2106.04560v1 fatcat:zl5lx3pq5jeqndnawkqjoo6bbq

Scaled ReLU Matters for Training Vision Transformers [article]

Pichao Wang and Xue Wang and Hao Luo and Jingkai Zhou and Zhipeng Zhou and Fan Wang and Hao Li and Rong Jin
2022 arXiv   pre-print
Vision transformers (ViTs) have been an alternative design paradigm to convolutional neural networks (CNNs).  ...  The reasons for training difficulty are empirically analysed in , and the authors conjecture that the issue lies with the patchify-stem of ViT models and propose that early convolutions help transformers  ...  In this paper, we investigate this basic block for training vision transformers as a lightweight stem. Vision Transformers (ViTs). Since Dosovitskiy et al.  ... 
arXiv:2109.03810v2 fatcat:jr22t7xle5c3hhrtd34pu227zu

CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification [article]

Chun-Fu Chen, Quanfu Fan, Rameswar Panda
2021 arXiv   pre-print
Inspired by this, in this paper, we study how to learn multi-scale feature representations in transformer models for image classification.  ...  The recently developed vision transformer (ViT) has achieved promising results on image classification compared to convolutional neural networks.  ...  Figure 2 illustrates the network architecture of our proposed Cross-Attention Multi-Scale Vision Transformer (CrossViT).  ... 
arXiv:2103.14899v2 fatcat:ui7ufd7dnnbavnkuvi4raitbua

Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding [article]

Pengchuan Zhang, Xiyang Dai, Jianwei Yang, Bin Xiao, Lu Yuan, Lei Zhang, Jianfeng Gao
2021 arXiv   pre-print
This paper presents a new Vision Transformer (ViT) architecture Multi-Scale Vision Longformer, which significantly enhances the ViT of for encoding high-resolution images using two techniques.  ...  A comprehensive empirical study shows that the new ViT significantly outperforms several strong baselines, including the existing ViT models and their ResNet counterparts, and the Pyramid Vision Transformer  ...  To obtain a multi-scale vision Transformer, we stack multiple (e.g., four) vision Transformers (ViT stages) sequentially.  ... 
arXiv:2103.15358v2 fatcat:vxitfie6ovd5vanw3wlusi4the

CrossFormer: A Versatile Vision Transformer Hinging on Cross-scale Attention [article]

Wenxiao Wang, Lu Yao, Long Chen, Binbin Lin, Deng Cai, Xiaofei He, Wei Liu
2021 arXiv   pre-print
Transformers have made great progress in dealing with computer vision tasks.  ...  The reasons are two-fold: (1) Input embeddings of each layer are equal-scale, so no cross-scale feature can be extracted; (2) to lower the computational cost, some vision transformers merge adjacent embeddings  ...  BACKGROUND Vision Transformers.  ... 
arXiv:2108.00154v2 fatcat:i5etc3f6zfelhjynatsmq66hjy

Multi-Scale High-Resolution Vision Transformer for Semantic Segmentation [article]

Jiaqi Gu, Hyoukjun Kwon, Dilin Wang, Wei Ye, Meng Li, Yu-Hsin Chen, Liangzhen Lai, Vikas Chandra, David Z. Pan
2021 arXiv   pre-print
Vision Transformers (ViTs) have emerged with superior performance on computer vision tasks compared to convolutional neural network (CNN)-based models.  ...  Therefore, we propose HRViT, which enhances ViTs to learn semantically-rich and spatially-precise multi-scale representations by integrating high-resolution multi-branch architectures with ViTs.  ...  Therefore, we propose HRViT, an efficient multi-scale high-resolution vision Transformer backbone specifically optimized for semantic segmentation.  ... 
arXiv:2111.01236v2 fatcat:pz7p32vrkravlfamypkarp6vl4

Down-Scaling for Better Transform Compression [chapter]

Alfred M. Bruckstein⋆, Michael Elad, Ron Kimmel*
2001 Scale-Space and Morphology in Computer Vision  
Down-Scaling for Better Transform Compression Alfred M. Bruckstein, Michael Elad, and Ron Kimmel Abstract-The most popular lossy image compression method used on the Internet is the JPEG standard.  ...  Assume we have a gray scale image of size 512 512 with 8 bits/pixel as our original image.  ... 
doi:10.1007/3-540-47778-0_11 dblp:conf/scalespace/BrucksteinEK01 fatcat:lk37vkpibnfqvfu5jhpe7etezq

Multi-scale Arithmetization of Linear Transformations

Loïc Mazo
2018 Journal of Mathematical Imaging and Vision  
In this setting, the nonstandard version of the Euclidean affine transformation gives rise to a sequence of quasi-linear transformations over integer spaces, allowing integer-only computations.  ...  A constructive nonstandard interpretation of a multiscale affine transformation scheme is presented.  ...  It is just a first step toward a constructive, multi-1 scale, model of such transformations.  ... 
doi:10.1007/s10851-018-0853-6 fatcat:yutwumtkprbabjl36negnpj7oq

On the gray-scale inverse Hough transform

A.L Kesidis, N Papamarkos
2000 Image and Vision Computing  
This paper proposes a gray-scale inverse Hough transform (GIHT) algorithm which is combined with a modified gray-scale Hough transform (GHT).  ...  Given only the data of the Hough transform (HT) space and the dimensions of the image, the GIHT algorithm reconstructs correctly the original gray-scale image.  ...  The gray-scale Hough transform The gray-scale Hough transform is similar to the CHT but differs in the voting procedure.  ... 
doi:10.1016/s0262-8856(99)00067-0 fatcat:a4ia7mzqojhrvomz76c65q2adq

Multi-Scale Salience Distance Transforms

Paul L. Rosin, Geoff A. W. West
1993 Procedings of the British Machine Vision Conference 1993  
The distance transform has been proposed for use in computer vision for a number of applications such as matching and skeletonisation.  ...  This paper proposes two things: (1) a multi-scale distance transform to overcome the need to choose edge thresholds and scale and (2) the addition of various saliency factors such as edge strength, length  ...  and (3) The same multi-scale approach can be applied to the Salience Distance Transform (SDT) described in section 2 to form the Multi Scale Salience Distance Transform (MSSDT).  ... 
doi:10.5244/c.7.58 dblp:conf/bmvc/RosinW93 fatcat:qvnfrnmrqbbbbfxxu6fzd2gex4

Billion-Scale Pretraining with Vision Transformers for Multi-Task Visual Representations [article]

Josh Beal, Hao-Yu Wu, Dong Huk Park, Andrew Zhai, Dmitry Kislyuk
2021 arXiv   pre-print
We conduct extensive experiments to better understand the empirical relationships between Transformer-based architectures, dataset scale, and the performance of production vision systems.  ...  Through a comprehensive study of offline and online evaluation, we show that large-scale Transformer-based pretraining provides significant benefits to industry computer vision applications.  ...  adoption of the state-of-the-art Vision Transformer architecture.  ... 
arXiv:2108.05887v1 fatcat:gm5lzf4pkrg3zez7unuq7epp3a

Scaling the Scattering Transform: Deep Hybrid Networks

Edouard Oyallon, Eugene Belilovsky, Sergey Zagoruyko
2017 2017 IEEE International Conference on Computer Vision (ICCV)  
The specific representations derived from CNNs trained on large scale image recognition are often used as representations in other computer vision tasks or datasets [40, 42] .  ...  Consider a signal x(u), with u the spatial position index and an integer J ∈ N, which is the spatial scale of our scattering transform.  ... 
doi:10.1109/iccv.2017.599 dblp:conf/iccv/OyallonBZ17 fatcat:q6i5qrojsjcdrcmgh2qfaa6zby

Age classification using Radon transform and entropy based scaling SVM

Huiyu Zhou, Paul Miller, Jianguo Zhang
2011 Procedings of the British Machine Vision Conference 2011  
Image features can be extracted using a difference of Gaussian filter followed by Radon transform.  ...  To enhance the quality of feature selection, we introduce entropy estimation to the scaling classifier.  ...  Rotation-invariant features using Radon transform Object recognition and classification require invariant features against various transformations such as rotation, scale, illumination and deformation.  ... 
doi:10.5244/c.25.28 dblp:conf/bmvc/ZhouMZ11 fatcat:uqcr4pm5nfftpflg7rbcy7xztm

Light Field Scale-Depth Space Transform for Dense Depth Estimation

Ivana Tosic, Kathrin Berkner
2014 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops  
are captured at coarser scales and textured regions are found at finer scales.  ...  We first propose a method for construction of light field scale-depth spaces, by convolving a given light field with a special kernel adapted to the light field structure.  ...  One of most well known applications is the Scale-Invariant-Feature-Transform (SIFT), where feature detection is based on finding extrema in the scale-spaces built upon the Difference of Gaussian (DoG)  ... 
doi:10.1109/cvprw.2014.71 dblp:conf/cvpr/TosicB14 fatcat:7om7hieds5ayhg2766hfru6bvi

A Simple Single-Scale Vision Transformer for Object Localization and Instance Segmentation [article]

Wuyang Chen, Xianzhi Du, Fan Yang, Lucas Beyer, Xiaohua Zhai, Tsung-Yi Lin, Huizhong Chen, Jing Li, Xiaodan Song, Zhangyang Wang, Denny Zhou
2021 arXiv   pre-print
This work presents a simple vision transformer design as a strong baseline for object localization and instance segmentation tasks.  ...  By leveraging a constant feature resolution and hidden size throughout the encoder blocks, we propose a simple and compact ViT architecture called Universal Vision Transformer (UViT) that achieves strong  ...  These design choices are also naturally introduced to change vision transformer architectures [21, 28, 42] , with two main purposes: 1) support of multi-scale features, since dense vision tasks require  ... 
arXiv:2112.09747v1 fatcat:wyixjj5rzrh6zb2shr3v64karq
« Previous Showing results 1 — 15 out of 489,981 results