15,423 Hits in 6.2 sec

Conditional Positional Encodings for Vision Transformers [article]

Xiangxiang Chu and Zhi Tian and Bo Zhang and Xinlong Wang and Xiaolin Wei and Huaxia Xia and Chunhua Shen
2021 arXiv   pre-print
We propose a conditional positional encoding (CPE) scheme for vision Transformers.  ...  Built on PEG, we present Conditional Position encoding Vision Transformer (CPVT). We demonstrate that CPVT has visually similar attention maps compared to those with learned positional encodings.  ...  By doing so, Transformers are unlocked to process input images of arbitrary size without bicubic interpolation or fine-tuning. • We demonstrate that positional encoding is crucial to vision transformers  ... 
arXiv:2102.10882v2 fatcat:uihyzgc44ndmjn3rbbiltf7avm

Transforming Auto-Encoders [chapter]

Geoffrey E. Hinton, Alex Krizhevsky, Sida D. Wang
2011 Lecture Notes in Computer Science  
We show how neural networks can be used to learn features that output a whole vector of instantiation parameters and we argue that this is a much more promising way of dealing with variations in position  ...  By contrast, the computer vision community uses complicated, hand-engineered features, like SIFT [6], that produce a whole vector of outputs including an explicit representation of the pose of the feature  ...  and we do not need to know this visual entity or the origin of its coordinate frame in advance.  ... 
doi:10.1007/978-3-642-21735-7_6 fatcat:a7d7c64cozhpthh3evgkofqjo4

Implicit Transformer Network for Screen Content Image Continuous Super-Resolution [article]

Jingyu Yang, Sheng Shen, Huanjing Yue, Kun Li
2021 arXiv   pre-print
For high-quality continuous SR at arbitrary ratios, pixel values at query coordinates are inferred from image features at key coordinates by the proposed implicit transformer and an implicit position encoding  ...  To this end, we propose a novel Implicit Transformer Super-Resolution Network (ITSRN) for SCISR.  ...  Do we really need explicit position encodings for vision transformers? arXiv preprint arXiv:2102.10882, 2021. [34] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization.  ... 
arXiv:2112.06174v1 fatcat:u5tywi75avh33gddbn2zsaw5cy

Efficient Transformers: A Survey [article]

Yi Tay, Mostafa Dehghani, Dara Bahri, Donald Metzler
2022 arXiv   pre-print
In the field of natural language processing for example, Transformers have become an indispensable staple in the modern deep learning stack.  ...  Transformer model architectures have garnered immense interest lately due to their effectiveness across a range of domains like language, vision and reinforcement learning.  ...  We tried our best to incorporate most of the suggestions as we sat fit. We also thank Tamas Sarlos for feedback on this manuscript.  ... 
arXiv:2009.06732v3 fatcat:rxchuq3adrg3vlgn672pwd6evu

Learning Vision-Guided Quadrupedal Locomotion End-to-End with Cross-Modal Transformers [article]

Ruihan Yang, Minghao Zhang, Nicklas Hansen, Huazhe Xu, Xiaolong Wang
2022 arXiv   pre-print
We propose to address quadrupedal locomotion tasks using Reinforcement Learning (RL) with a Transformer-based model that learns to combine proprioceptive information and high-dimensional depth sensor inputs  ...  In this paper, we introduce LocoTransformer, an end-to-end RL method that leverages both proprioceptive states and visual observations for locomotion control.  ...  TRANSFORMER ENCODER We introduce the Transformer encoder to fuse the visual observations and the proprioceptive states for decision making.  ... 
arXiv:2107.03996v3 fatcat:l7tjgxb3prgv7p5vju4k5se2im

Transforming medical imaging with Transformers? A comparative review of key properties, current progresses, and future perspectives [article]

Jun Li, Junyu Chen, Yucheng Tang, Bennett A. Landman, S. Kevin Zhou
2022 arXiv   pre-print
, we offer a comprehensive review of the state-of-the-art Transformer-based approaches for medical imaging and exhibit current research progresses made in the areas of medical image segmentation, recognition  ...  Since medical imaging bear some resemblance to computer vision, it is natural to inquire about the status quo of Transformers in medical imaging and ask the question: can the Transformer models transform  ...  In this paper, we highlight the properties of Vision Transformers and present a comparative review for Transformer-based medical image analysis.  ... 
arXiv:2206.01136v1 fatcat:krji4fb2ivfulbu2biqx2ihsfa

PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers [article]

Xiaoyi Dong and Jianmin Bao and Ting Zhang and Dongdong Chen and Weiming Zhang and Lu Yuan and Dong Chen and Fang Wen and Nenghai Yu
2022 arXiv   pre-print
This paper explores a better codebook for BERT pre-training of vision transformers. The recent work BEiT successfully transfers BERT pre-training from NLP to the vision field.  ...  We demonstrate that the visual tokens generated by the proposed perceptual codebook do exhibit better semantic meanings, and subsequently help pre-training achieve superior transfer performance in various  ...  Acknowledgement We thank many colleagues at Microsoft for their help and useful discussions, including Hangbo Bao, Li Dong and Furu Wei.  ... 
arXiv:2111.12710v2 fatcat:qx43dqs3gzav5ia2f5ppqwsk4m

Sensorimotor transformations in the worlds of frogs and robots

Michael A. Arbib, Jim-Shih Liaw
1995 Artificial Intelligence  
We generalize an analysis of the interaction of perceptual schemas in the VISIONS system for computer vision to a view of the interaction of perceptual and motor schemas in distributed planning which,  ...  we argue, has great promise for integrating mechanisms for action and perception in both animal and robot.  ...  The approach to schema-based interpretation in the VISIONS computer vision system [14] employs active, independent, schema instances, and the schemas encode mechanisms for using features in multiple  ... 
doi:10.1016/0004-3702(94)00055-6 fatcat:tnuagftutfg75hdoo3hxwjr7cu

Transformers for One-Shot Visual Imitation [article]

Sudeep Dasari, Abhinav Gupta
2020 arXiv   pre-print
For example, objects may be placed in different locations (e.g. kitchen layouts are different in every house).  ...  However, expanding these techniques to work with a single positive example during test time is still an open challenge.  ...  needed for robust control during test time.  ... 
arXiv:2011.05970v1 fatcat:mtnbrsyhknd6hnu66bvzbk6keu

[Re] Weakly-Supervised Semantic Segmentation via Transformer Explainability

Ioannis Athanasiadis, Georgios Moschovis, Alexander Tuoma, \\ Sharath Chandra Raparthy Koustuv Sinha
2022 Zenodo  
When it comes to replicating [1], the authors provided most of the information required to reproduce the vision-related experiments with the code compensating for what was missing.  ...  We found particularly easy to run and understand the code provided by the original authors of both [1] and [2] papers.  ...  Furthermore, to actually normalize the CAMs, all we need to do is divide each of them by 2, which is what the normalization below would do since R u (n) j and R v (n) j have identical sums.  ... 
doi:10.5281/zenodo.6574631 fatcat:fvti4isywzcxvnvsolx3hill2e

Transformer-based Conditional Variational Autoencoder for Controllable Story Generation [article]

Le Fang, Tao Zeng, Chaochun Liu, Liefeng Bo, Wen Dong, Changyou Chen
2021 arXiv   pre-print
We investigate large-scale latent variable models (LVMs) for neural story generation -- an under-explored application for open-domain long text -- with objectives in two threads: generation effectiveness  ...  Recently, Transformers and its variants have achieved remarkable effectiveness without explicit latent representation learning, thus lack satisfying controllability in generation.  ...  for positive.  ... 
arXiv:2101.00828v2 fatcat:fra5lvnefres5jb4ztqm7mkhle

Mask Transformer: Unpaired Text Style Transfer Based on Masked Language

Chunhua Wu, Xiaolong Chen, Xingbiao Li
2020 Applied Sciences  
We propose a "Mask and Generation" structure, which can obtain an explicit representation of the content of original sentence and generate the target sentence with a transformer.  ...  As the explicit representation is readable and the model has better interpretability, we can clearly know which words changed and why the words changed.  ...  Therefore, some researchers have proposed some methods that do not need to separate the style and content. Reference [7] was inspired by the cycle style transfer method [17] in Computer Vision.  ... 
doi:10.3390/app10186196 fatcat:ctc37jka2vb7pir7ezsfry65de

Zero-Shot Controlled Generation with Encoder-Decoder Transformers [article]

Devamanyu Hazarika, Mahdi Namazifar, Dilek Hakkani-Tür
2022 arXiv   pre-print
In this work, we propose novel approaches for controlling encoder-decoder transformer-based NLG models in zero-shot.  ...  We also study how this hypothesis could lead to more efficient ways for training encoder-decoder transformer models.  ...  Acknowledgements We are truly grateful for valuable feedbacks from Gokhan Tur, Yang Liu, Behnam Hedayatnia, Nicole Chartier, and Mohit Bansal.  ... 
arXiv:2106.06411v3 fatcat:4hlkat6zinar5egyg4g4pauqga

Transforming vision into action

Melvyn A. Goodale
2011 Vision Research  
A new model of the functional organization of the visual pathways in the primate cerebral cortex has emerged, one that posits a division of labor between vision-for-action (the dorsal stream) and vision-for-perception  ...  Except for the study of eye movements, which have been regarded as an information-seeking adjunct to visual perception, little attention was paid to the way in which vision is used to control our actions  ...  We may need to recognize objects we have seen minutes, hours, days -or even years before.  ... 
doi:10.1016/j.visres.2010.07.027 pmid:20691202 fatcat:6j4htd5wyfatjk7c3e6vf2odmq

ViewFormer: NeRF-free Neural Rendering from Few Images Using Transformers [article]

Jonáš Kulhánek and Erik Derner and Torsten Sattler and Robert Babuška
2022 arXiv   pre-print
To train our model efficiently, we introduce a novel branching attention mechanism that allows us to use the same model not only for neural rendering but also for camera pose estimation.  ...  Our model uses a two-stage architecture consisting of a codebook and a transformer model.  ...  Acknowledgements This work was supported by the European Regional Development Fund under projects Robotics for Industry 4.0 (reg. no. CZ.02.1.01/0.0/0.0/15 003/0000470) and IMPACT (reg. no.  ... 
arXiv:2203.10157v1 fatcat:hlg53r5jareh3b5hujttugdt54
« Previous Showing results 1 — 15 out of 15,423 results