A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2021; you can also visit the original URL.
The file type is application/pdf
.
Filters
Scaling Vision Transformers
[article]
2021
arXiv
pre-print
While the laws for scaling Transformer language models have been studied, it is unknown how Vision Transformers scale. ...
Attention-based neural networks such as the Vision Transformer (ViT) have recently attained state-of-the-art results on many computer vision benchmarks. ...
We are the first to scale
Vision Transformers to even larger size and reache new state-of-the-art results doing so.
5 Discussion
We demonstrate that the performance-compute frontier for Vision Transformer ...
arXiv:2106.04560v1
fatcat:zl5lx3pq5jeqndnawkqjoo6bbq
Scaled ReLU Matters for Training Vision Transformers
[article]
2022
arXiv
pre-print
Vision transformers (ViTs) have been an alternative design paradigm to convolutional neural networks (CNNs). ...
The reasons for training difficulty are empirically analysed in , and the authors conjecture that the issue lies with the patchify-stem of ViT models and propose that early convolutions help transformers ...
In this paper, we investigate this basic block for training vision transformers as a lightweight stem. Vision Transformers (ViTs). Since Dosovitskiy et al. ...
arXiv:2109.03810v2
fatcat:jr22t7xle5c3hhrtd34pu227zu
CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification
[article]
2021
arXiv
pre-print
Inspired by this, in this paper, we study how to learn multi-scale feature representations in transformer models for image classification. ...
The recently developed vision transformer (ViT) has achieved promising results on image classification compared to convolutional neural networks. ...
Figure 2 illustrates the network architecture of our proposed Cross-Attention Multi-Scale Vision Transformer (CrossViT). ...
arXiv:2103.14899v2
fatcat:ui7ufd7dnnbavnkuvi4raitbua
Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding
[article]
2021
arXiv
pre-print
This paper presents a new Vision Transformer (ViT) architecture Multi-Scale Vision Longformer, which significantly enhances the ViT of for encoding high-resolution images using two techniques. ...
A comprehensive empirical study shows that the new ViT significantly outperforms several strong baselines, including the existing ViT models and their ResNet counterparts, and the Pyramid Vision Transformer ...
To obtain a multi-scale vision Transformer, we stack multiple (e.g., four) vision Transformers (ViT stages) sequentially. ...
arXiv:2103.15358v2
fatcat:vxitfie6ovd5vanw3wlusi4the
CrossFormer: A Versatile Vision Transformer Hinging on Cross-scale Attention
[article]
2021
arXiv
pre-print
Transformers have made great progress in dealing with computer vision tasks. ...
The reasons are two-fold: (1) Input embeddings of each layer are equal-scale, so no cross-scale feature can be extracted; (2) to lower the computational cost, some vision transformers merge adjacent embeddings ...
BACKGROUND Vision Transformers. ...
arXiv:2108.00154v2
fatcat:i5etc3f6zfelhjynatsmq66hjy
Multi-Scale High-Resolution Vision Transformer for Semantic Segmentation
[article]
2021
arXiv
pre-print
Vision Transformers (ViTs) have emerged with superior performance on computer vision tasks compared to convolutional neural network (CNN)-based models. ...
Therefore, we propose HRViT, which enhances ViTs to learn semantically-rich and spatially-precise multi-scale representations by integrating high-resolution multi-branch architectures with ViTs. ...
Therefore, we propose HRViT, an efficient multi-scale high-resolution vision Transformer backbone specifically optimized for semantic segmentation. ...
arXiv:2111.01236v2
fatcat:pz7p32vrkravlfamypkarp6vl4
Down-Scaling for Better Transform Compression
[chapter]
2001
Scale-Space and Morphology in Computer Vision
Down-Scaling for Better Transform Compression Alfred M. Bruckstein, Michael Elad, and Ron Kimmel Abstract-The most popular lossy image compression method used on the Internet is the JPEG standard. ...
Assume we have a gray scale image of size 512 512 with 8 bits/pixel as our original image. ...
doi:10.1007/3-540-47778-0_11
dblp:conf/scalespace/BrucksteinEK01
fatcat:lk37vkpibnfqvfu5jhpe7etezq
Multi-scale Arithmetization of Linear Transformations
2018
Journal of Mathematical Imaging and Vision
In this setting, the nonstandard version of the Euclidean affine transformation gives rise to a sequence of quasi-linear transformations over integer spaces, allowing integer-only computations. ...
A constructive nonstandard interpretation of a multiscale affine transformation scheme is presented. ...
It is just a first step toward a constructive, multi-1 scale, model of such transformations. ...
doi:10.1007/s10851-018-0853-6
fatcat:yutwumtkprbabjl36negnpj7oq
On the gray-scale inverse Hough transform
2000
Image and Vision Computing
This paper proposes a gray-scale inverse Hough transform (GIHT) algorithm which is combined with a modified gray-scale Hough transform (GHT). ...
Given only the data of the Hough transform (HT) space and the dimensions of the image, the GIHT algorithm reconstructs correctly the original gray-scale image. ...
The gray-scale Hough transform The gray-scale Hough transform is similar to the CHT but differs in the voting procedure. ...
doi:10.1016/s0262-8856(99)00067-0
fatcat:a4ia7mzqojhrvomz76c65q2adq
Multi-Scale Salience Distance Transforms
1993
Procedings of the British Machine Vision Conference 1993
The distance transform has been proposed for use in computer vision for a number of applications such as matching and skeletonisation. ...
This paper proposes two things: (1) a multi-scale distance transform to overcome the need to choose edge thresholds and scale and (2) the addition of various saliency factors such as edge strength, length ...
and (3) The same multi-scale approach can be applied to the Salience Distance Transform (SDT) described in section 2 to form the Multi Scale Salience Distance Transform (MSSDT). ...
doi:10.5244/c.7.58
dblp:conf/bmvc/RosinW93
fatcat:qvnfrnmrqbbbbfxxu6fzd2gex4
Billion-Scale Pretraining with Vision Transformers for Multi-Task Visual Representations
[article]
2021
arXiv
pre-print
We conduct extensive experiments to better understand the empirical relationships between Transformer-based architectures, dataset scale, and the performance of production vision systems. ...
Through a comprehensive study of offline and online evaluation, we show that large-scale Transformer-based pretraining provides significant benefits to industry computer vision applications. ...
adoption of the state-of-the-art Vision Transformer architecture. ...
arXiv:2108.05887v1
fatcat:gm5lzf4pkrg3zez7unuq7epp3a
Scaling the Scattering Transform: Deep Hybrid Networks
2017
2017 IEEE International Conference on Computer Vision (ICCV)
The specific representations derived from CNNs trained on large scale image recognition are often used as representations in other computer vision tasks or datasets [40, 42] . ...
Consider a signal x(u), with u the spatial position index and an integer J ∈ N, which is the spatial scale of our scattering transform. ...
doi:10.1109/iccv.2017.599
dblp:conf/iccv/OyallonBZ17
fatcat:q6i5qrojsjcdrcmgh2qfaa6zby
Age classification using Radon transform and entropy based scaling SVM
2011
Procedings of the British Machine Vision Conference 2011
Image features can be extracted using a difference of Gaussian filter followed by Radon transform. ...
To enhance the quality of feature selection, we introduce entropy estimation to the scaling classifier. ...
Rotation-invariant features using Radon transform Object recognition and classification require invariant features against various transformations such as rotation, scale, illumination and deformation. ...
doi:10.5244/c.25.28
dblp:conf/bmvc/ZhouMZ11
fatcat:uqcr4pm5nfftpflg7rbcy7xztm
Light Field Scale-Depth Space Transform for Dense Depth Estimation
2014
2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops
are captured at coarser scales and textured regions are found at finer scales. ...
We first propose a method for construction of light field scale-depth spaces, by convolving a given light field with a special kernel adapted to the light field structure. ...
One of most well known applications is the Scale-Invariant-Feature-Transform (SIFT), where feature detection is based on finding extrema in the scale-spaces built upon the Difference of Gaussian (DoG) ...
doi:10.1109/cvprw.2014.71
dblp:conf/cvpr/TosicB14
fatcat:7om7hieds5ayhg2766hfru6bvi
A Simple Single-Scale Vision Transformer for Object Localization and Instance Segmentation
[article]
2021
arXiv
pre-print
This work presents a simple vision transformer design as a strong baseline for object localization and instance segmentation tasks. ...
By leveraging a constant feature resolution and hidden size throughout the encoder blocks, we propose a simple and compact ViT architecture called Universal Vision Transformer (UViT) that achieves strong ...
These design choices are also naturally introduced to change vision transformer architectures [21, 28, 42] , with two main purposes: 1) support of multi-scale features, since dense vision tasks require ...
arXiv:2112.09747v1
fatcat:wyixjj5rzrh6zb2shr3v64karq
« Previous
Showing results 1 — 15 out of 489,981 results