Filters








239 Hits in 6.3 sec

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale [article]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby
2021 arXiv   pre-print
We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks.  ...  When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results  ...  Image patches are treated the same way as tokens (words) in an NLP application.  ... 
arXiv:2010.11929v2 fatcat:myedumsklfcidim27uii6plwq4

An Image is Worth 16x16 Words, What is a Video Worth? [article]

Gilad Sharir, Asaf Noy, Lihi Zelnik-Manor
2021 arXiv   pre-print
Code is available at: https://github.com/Alibaba-MIIL/STAM  ...  Leading methods in the domain of action recognition try to distill information from both the spatial and temporal dimensions of an input video.  ...  Method As shown by the recent work on Visual Transformers (ViT) [9] , self-attention models provide powerful representations for images, by viewing an image as a sequence of words, where each word embedding  ... 
arXiv:2103.13915v2 fatcat:dpwxoxo6wzdj7fenj4yzgxzlli

Encoding Retina Image to Words using Ensemble of Vision Transformers for Diabetic Retinopathy Grading

Nouar AlDahoul, Hezerul Abdul Karim, Myles Joshua Toledo Tan, Mhd Adel Momo, Jamie Ledesma Fermin
2021 F1000Research  
To enhance DR grading, this paper proposes a novel solution based on an ensemble of state-of-the-art deep learning models called vision transformers.  ...  Existing automatic solutions are mostly based on traditional image processing and machine learning techniques. Hence, there is a big gap when it comes to more generic detection and grading of DR.  ...  .: An image is worth 16x16 words: Transformers for image recognition at scale. ICLR. 2021. 31. Ba JL, Kiros JR, Hinton GE: Layer Normalization. arXiv:1607.06450 [cs, stat]. Jul. 2016.  ... 
doi:10.12688/f1000research.73082.1 fatcat:d2mtsxrnkrfuflero42nrbfp7y

Learning invariant features through topographic filter maps

K. Kavukcuoglu, M.A. Ranzato, R. Fergus, Yann Le-Cun
2009 2009 IEEE Conference on Computer Vision and Pattern Recognition  
The learned feature descriptors give comparable results as SIFT on image recognition tasks for which SIFT is well suited, and better results than SIFT on tasks for which SIFT is less well suited.  ...  The first stage is often composed of three main modules: (1) a bank of filters (often oriented edge detectors); (2) a non-linear transform, such as a point-wise squashing functions, quantization, or normalization  ...  Acknowledgments We thank Karol Gregor, Y-Lan Boureau, Eero Simoncelli, and members of the CIfAR program Neural Computation and Adaptive Perception for helpful discussions.  ... 
doi:10.1109/cvprw.2009.5206545 fatcat:zxun6vqtezb4pjudj6jgnyweby

Learning invariant features through topographic filter maps

Koray Kavukcuoglu, Marc'Aurelio Ranzato, Rob Fergus, Yann LeCun
2009 2009 IEEE Conference on Computer Vision and Pattern Recognition  
The learned feature descriptors give comparable results as SIFT on image recognition tasks for which SIFT is well suited, and better results than SIFT on tasks for which SIFT is less well suited.  ...  The first stage is often composed of three main modules: (1) a bank of filters (often oriented edge detectors); (2) a non-linear transform, such as a point-wise squashing functions, quantization, or normalization  ...  Acknowledgments We thank Karol Gregor, Y-Lan Boureau, Eero Simoncelli, and members of the CIfAR program Neural Computation and Adaptive Perception for helpful discussions.  ... 
doi:10.1109/cvpr.2009.5206545 dblp:conf/cvpr/KavukcuogluRFL09 fatcat:ze5eoanhknc4nhrbuwgneti5eu

Pre-Training Transformers for Domain Adaptation [article]

Burhan Ul Tayyab, Nicholas Chua
2021 arXiv   pre-print
The Visual Domain Adaptation Challenge 2021 called for unsupervised domain adaptation methods that could improve the performance of models by transferring the knowledge obtained from source datasets to  ...  An image is worth 16x16 words: Transformers for image recognition at scale, 2021. [24] Christoph Schuhmann. 400-million open dataset, Oct 2021. [25] Chen Sun, Abhinav Shrivastava, Saurabh Singh, and  ...  An image is worth 16x16 words: Trans- formers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020. [10] Geoffrey E Hinton, Zoubin Ghahramani, and Yee Whye Teh.  ... 
arXiv:2112.09965v1 fatcat:7uecoevbqnck7ggzwtyygm6ntm

Rotation, scaling and translation invariant object recognition in satellite images

SOYMAN Yusuf; ILGIN
2016 Communications Faculty Of Science University of Ankara  
In this paper, rotation, scaling and translation invariant object recognition in satellite imagery is performed.  ...  Algorithms used to recognize objects in satellite images should find them without being affected from these variations.  ...  Hessian matrix H(x, y, σ) at a point (x,y) in an image I at scale σ is defined as follows; L x y L x y xy L x y L x y                . (3.7) where Lxx(x,y,σ) is the convolution of the image  ... 
doi:10.1501/commua1-2_0000000095 fatcat:saugosnqnndv3lmrdbuargumsm

A Two-Layer Local Constrained Sparse Coding Method for Fine-Grained Visual Categorization [article]

Guo Lihua, Guo Chenggan
2015 arXiv   pre-print
The two-layer architecture is introduced for learning intermediate-level features, and the local constrained term is applied to guarantee the local smooth of coding coefficients.  ...  For extracting more discriminative information, local orientation histograms are the input of sparse coding instead of raw pixels.  ...  It's worth noting that dictionaries in both pipelines and scales are trained separately to best fit for features.  ... 
arXiv:1505.02505v1 fatcat:33y5qp22ajd3dlaexgao2vxulq

Development of an object recognition algorithm based on neural networks With using a hierarchical classifier

V.T. Nguyen, F.F. Pashchenko
2021 Procedia Computer Science  
This paper proposes the architecture of a convolutional neural network that creates a neural network system for recognizing objects in images using our own approach to classification using a hierarchical  ...  The architecture will be assigned to find the optimal solution to the problem for many sets of image data and, unlike existing approaches, will have high performance indicators without losing the number  ...  In other words, they can perform recognition only if there is minimal noise and there is no transformation of the analyzed object located on the "white" scene.  ... 
doi:10.1016/j.procs.2021.03.055 fatcat:6pepwj46v5bj5ierpdq3nhvuiu

Detection and Recognition of Diseases from Paddy Plant Leaf Images

K. Jagan, M. Balasubramanian, S. Palanivel
2016 International Journal of Computer Applications  
It is an image recognition system for identifying the paddy plant diseases that first involves disease detection and then disease recognition.  ...  In this work Scale Invariant Feature Transform (SIFT) is used to get features from the disease affected images.  ... 
doi:10.5120/ijca2016910505 fatcat:ejla57tt65fcddh5iwja7aic5m

Not All Images are Worth 16x16 Words: Dynamic Transformers for Efficient Image Recognition [article]

Yulin Wang, Rui Huang, Shiji Song, Zeyi Huang, Gao Huang
2021 arXiv   pre-print
Vision Transformers (ViT) have achieved remarkable success in large-scale image recognition. They split every 2D image into a fixed number of patches, each of which is treated as a token.  ...  This is achieved by cascading multiple Transformers with increasing numbers of tokens, which are sequentially activated in an adaptive fashion at test time, i.e., the inference is terminated once a sufficiently  ...  Acknowledgements This work is supported in part by the National Science and Technology Major Project of the Ministry of Science and Technology of China under Grants 2018AAA0100701, the National Natural  ... 
arXiv:2105.15075v2 fatcat:fouxp7s4z5a7hlherv43ueszhu

Make A Long Image Short: Adaptive Token Length for Vision Transformers [article]

Yichen Zhu, Yuqin Zhu, Jie Du, Yi Wang, Zhicai Ou, Feifei Feng, Jian Tang
2021 arXiv   pre-print
Motivated by the proverb "A picture is worth a thousand words" we aim to accelerate the ViT model by making a long image short.  ...  The vision transformer splits each image into a sequence of tokens with fixed length and processes the tokens in the same way as words in natural language processing.  ...  An image is worth 16x16 words: Trans- [27] Zhenhua Liu, Yunhe Wang, Kai Han, Siwei Ma, and Wen formers for image recognition at scale. arXiv preprint Gao.  ... 
arXiv:2112.01686v2 fatcat:fzenydaarjg3jffxjwv324qsa4

Transformer based trajectory prediction [article]

Aleksey Postnikov, Aleksander Gamayunov, Gonzalo Ferrer
2021 arXiv   pre-print
Motion prediction is an extremely challenging task which recently gained significant attention of the research community.  ...  In this work, we present a simple and yet strong baseline for uncertainty aware motion prediction based purely on transformer neural networks, which has shown its effectiveness in conditions of domain  ...  “An image is worth 16x16 words: Transformers for image recognition at scale”. In: arXiv preprint arXiv:2010.11929 (2020). [6] Liangji Fang et al.  ... 
arXiv:2112.04350v1 fatcat:lz5swmmhgnaajbcl6c6s3te32i

Implicit Transformer Network for Screen Content Image Continuous Super-Resolution [article]

Jingyu Yang, Sheng Shen, Huanjing Yue, Kun Li
2021 arXiv   pre-print
For high-quality continuous SR at arbitrary ratios, pixel values at query coordinates are inferred from image features at key coordinates by the proposed implicit transformer and an implicit position encoding  ...  However, image SR methods mostly designed for natural images do not generalize well for SCIs due to the very different image characteristics as well as the requirement of SCI browsing at arbitrary scales  ...  An image is worth 16x16 words: Transformers for image recognition at scale.  ... 
arXiv:2112.06174v1 fatcat:u5tywi75avh33gddbn2zsaw5cy

Audio-Visual Speech Recognition is Worth 32×32×8 Voxels [article]

Dmitriy Serdyuk, Otavio Braga, Olivier Siohan
2021 arXiv   pre-print
Recently, image transformers [2] have been introduced to extract visual features useful for image classification tasks.  ...  On an AV-ASR task, the transformer front-end performs as well as (or better than) the convolutional baseline. Fine-tuning our model on the LRS3-TED training set matches previous state of the art.  ...  In a nutshell, the ViT model extracts non-overlapping 16x16 image patches, embeds them with a linear transform, and runs a transformer.  ... 
arXiv:2109.09536v1 fatcat:hhy76zxdyrdkpmdgasmreh5u7a
« Previous Showing results 1 — 15 out of 239 results