A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2021; you can also visit the original URL.
The file type is application/pdf
.
Filters
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
[article]
2021
arXiv
pre-print
We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. ...
When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results ...
Image patches are treated the same way as tokens (words) in an NLP application. ...
arXiv:2010.11929v2
fatcat:myedumsklfcidim27uii6plwq4
An Image is Worth 16x16 Words, What is a Video Worth?
[article]
2021
arXiv
pre-print
Code is available at: https://github.com/Alibaba-MIIL/STAM ...
Leading methods in the domain of action recognition try to distill information from both the spatial and temporal dimensions of an input video. ...
Method As shown by the recent work on Visual Transformers (ViT) [9] , self-attention models provide powerful representations for images, by viewing an image as a sequence of words, where each word embedding ...
arXiv:2103.13915v2
fatcat:dpwxoxo6wzdj7fenj4yzgxzlli
Encoding Retina Image to Words using Ensemble of Vision Transformers for Diabetic Retinopathy Grading
2021
F1000Research
To enhance DR grading, this paper proposes a novel solution based on an ensemble of state-of-the-art deep learning models called vision transformers. ...
Existing automatic solutions are mostly based on traditional image processing and machine learning techniques. Hence, there is a big gap when it comes to more generic detection and grading of DR. ...
.: An image is worth 16x16 words: Transformers for image recognition at scale. ICLR. 2021. 31. Ba JL, Kiros JR, Hinton GE: Layer Normalization. arXiv:1607.06450 [cs, stat]. Jul. 2016. ...
doi:10.12688/f1000research.73082.1
fatcat:d2mtsxrnkrfuflero42nrbfp7y
Learning invariant features through topographic filter maps
2009
2009 IEEE Conference on Computer Vision and Pattern Recognition
The learned feature descriptors give comparable results as SIFT on image recognition tasks for which SIFT is well suited, and better results than SIFT on tasks for which SIFT is less well suited. ...
The first stage is often composed of three main modules: (1) a bank of filters (often oriented edge detectors); (2) a non-linear transform, such as a point-wise squashing functions, quantization, or normalization ...
Acknowledgments We thank Karol Gregor, Y-Lan Boureau, Eero Simoncelli, and members of the CIfAR program Neural Computation and Adaptive Perception for helpful discussions. ...
doi:10.1109/cvprw.2009.5206545
fatcat:zxun6vqtezb4pjudj6jgnyweby
Learning invariant features through topographic filter maps
2009
2009 IEEE Conference on Computer Vision and Pattern Recognition
The learned feature descriptors give comparable results as SIFT on image recognition tasks for which SIFT is well suited, and better results than SIFT on tasks for which SIFT is less well suited. ...
The first stage is often composed of three main modules: (1) a bank of filters (often oriented edge detectors); (2) a non-linear transform, such as a point-wise squashing functions, quantization, or normalization ...
Acknowledgments We thank Karol Gregor, Y-Lan Boureau, Eero Simoncelli, and members of the CIfAR program Neural Computation and Adaptive Perception for helpful discussions. ...
doi:10.1109/cvpr.2009.5206545
dblp:conf/cvpr/KavukcuogluRFL09
fatcat:ze5eoanhknc4nhrbuwgneti5eu
Pre-Training Transformers for Domain Adaptation
[article]
2021
arXiv
pre-print
The Visual Domain Adaptation Challenge 2021 called for unsupervised domain adaptation methods that could improve the performance of models by transferring the knowledge obtained from source datasets to ...
An image
is worth 16x16 words: Transformers for image recognition at scale, 2021.
[24] Christoph Schuhmann. 400-million open dataset, Oct 2021.
[25] Chen Sun, Abhinav Shrivastava, Saurabh Singh, and ...
An image is worth 16x16 words: Trans-
formers for image recognition at scale. arXiv preprint arXiv:2010.11929,
2020.
[10] Geoffrey E Hinton, Zoubin Ghahramani, and Yee Whye Teh. ...
arXiv:2112.09965v1
fatcat:7uecoevbqnck7ggzwtyygm6ntm
Rotation, scaling and translation invariant object recognition in satellite images
2016
Communications Faculty Of Science University of Ankara
In this paper, rotation, scaling and translation invariant object recognition in satellite imagery is performed. ...
Algorithms used to recognize objects in satellite images should find them without being affected from these variations. ...
Hessian matrix H(x, y, σ) at a point (x,y) in an image I at scale σ is defined as follows; L x y L x y xy L x y L x y . (3.7) where Lxx(x,y,σ) is the convolution of the image ...
doi:10.1501/commua1-2_0000000095
fatcat:saugosnqnndv3lmrdbuargumsm
A Two-Layer Local Constrained Sparse Coding Method for Fine-Grained Visual Categorization
[article]
2015
arXiv
pre-print
The two-layer architecture is introduced for learning intermediate-level features, and the local constrained term is applied to guarantee the local smooth of coding coefficients. ...
For extracting more discriminative information, local orientation histograms are the input of sparse coding instead of raw pixels. ...
It's worth noting that dictionaries in both pipelines and scales are trained separately to best fit for features. ...
arXiv:1505.02505v1
fatcat:33y5qp22ajd3dlaexgao2vxulq
Development of an object recognition algorithm based on neural networks With using a hierarchical classifier
2021
Procedia Computer Science
This paper proposes the architecture of a convolutional neural network that creates a neural network system for recognizing objects in images using our own approach to classification using a hierarchical ...
The architecture will be assigned to find the optimal solution to the problem for many sets of image data and, unlike existing approaches, will have high performance indicators without losing the number ...
In other words, they can perform recognition only if there is minimal noise and there is no transformation of the analyzed object located on the "white" scene. ...
doi:10.1016/j.procs.2021.03.055
fatcat:6pepwj46v5bj5ierpdq3nhvuiu
Detection and Recognition of Diseases from Paddy Plant Leaf Images
2016
International Journal of Computer Applications
It is an image recognition system for identifying the paddy plant diseases that first involves disease detection and then disease recognition. ...
In this work Scale Invariant Feature Transform (SIFT) is used to get features from the disease affected images. ...
doi:10.5120/ijca2016910505
fatcat:ejla57tt65fcddh5iwja7aic5m
Not All Images are Worth 16x16 Words: Dynamic Transformers for Efficient Image Recognition
[article]
2021
arXiv
pre-print
Vision Transformers (ViT) have achieved remarkable success in large-scale image recognition. They split every 2D image into a fixed number of patches, each of which is treated as a token. ...
This is achieved by cascading multiple Transformers with increasing numbers of tokens, which are sequentially activated in an adaptive fashion at test time, i.e., the inference is terminated once a sufficiently ...
Acknowledgements This work is supported in part by the National Science and Technology Major Project of the Ministry of Science and Technology of China under Grants 2018AAA0100701, the National Natural ...
arXiv:2105.15075v2
fatcat:fouxp7s4z5a7hlherv43ueszhu
Make A Long Image Short: Adaptive Token Length for Vision Transformers
[article]
2021
arXiv
pre-print
Motivated by the proverb "A picture is worth a thousand words" we aim to accelerate the ViT model by making a long image short. ...
The vision transformer splits each image into a sequence of tokens with fixed length and processes the tokens in the same way as words in natural language processing. ...
An image is worth 16x16 words: Trans- [27] Zhenhua Liu, Yunhe Wang, Kai Han, Siwei Ma, and Wen
formers for image recognition at scale. arXiv preprint Gao. ...
arXiv:2112.01686v2
fatcat:fzenydaarjg3jffxjwv324qsa4
Transformer based trajectory prediction
[article]
2021
arXiv
pre-print
Motion prediction is an extremely challenging task which recently gained significant attention of the research community. ...
In this work, we present a simple and yet strong baseline for uncertainty aware motion prediction based purely on transformer neural networks, which has shown its effectiveness in conditions of domain ...
“An image is worth 16x16 words: Transformers for image recognition
at scale”. In: arXiv preprint arXiv:2010.11929 (2020).
[6] Liangji Fang et al. ...
arXiv:2112.04350v1
fatcat:lz5swmmhgnaajbcl6c6s3te32i
Implicit Transformer Network for Screen Content Image Continuous Super-Resolution
[article]
2021
arXiv
pre-print
For high-quality continuous SR at arbitrary ratios, pixel values at query coordinates are inferred from image features at key coordinates by the proposed implicit transformer and an implicit position encoding ...
However, image SR methods mostly designed for natural images do not generalize well for SCIs due to the very different image characteristics as well as the requirement of SCI browsing at arbitrary scales ...
An image is worth 16x16 words: Transformers for image
recognition at scale. ...
arXiv:2112.06174v1
fatcat:u5tywi75avh33gddbn2zsaw5cy
Audio-Visual Speech Recognition is Worth 32×32×8 Voxels
[article]
2021
arXiv
pre-print
Recently, image transformers [2] have been introduced to extract visual features useful for image classification tasks. ...
On an AV-ASR task, the transformer front-end performs as well as (or better than) the convolutional baseline. Fine-tuning our model on the LRS3-TED training set matches previous state of the art. ...
In a nutshell, the ViT model extracts non-overlapping 16x16 image patches, embeds them with a linear transform, and runs a transformer. ...
arXiv:2109.09536v1
fatcat:hhy76zxdyrdkpmdgasmreh5u7a
« Previous
Showing results 1 — 15 out of 239 results