2,745 Hits in 4.5 sec

Adapting Self-Supervised Vision Transformers by Probing Attention-Conditioned Masking Consistency [article]

Viraj Prabhu, Sriram Yenamandra, Aaditya Singh, Judy Hoffman
2022 arXiv   pre-print
In this work, we shift focus to adapting modern architectures for object recognition -- the increasingly popular Vision Transformer (ViT) -- and modern pretraining based on self-supervised learning (SSL  ...  PACMAC first performs in-domain SSL on pooled source and target data to learn task-discriminative features, and then probes the model's predictive consistency across a set of partial target inputs generated  ...  This work was supported in part by funding from the DARPA LwLL project and ARL.  ... 
arXiv:2206.08222v1 fatcat:ixb55x3dbrbpnnxwvkubjjwb6e

iBOT: Image BERT Pre-Training with Online Tokenizer [article]

Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, Tao Kong
2022 arXiv   pre-print
Specifically, we perform self-distillation on masked patch tokens and take the teacher network as the online tokenizer, along with self-distillation on the class token to acquire visual semantics.  ...  The success of language Transformers is primarily attributed to the pretext task of masked language modeling (MLM), where texts are first tokenized into semantically meaningful pieces.  ...  Acknowledgement Tao Kong is the corresponding author. We would like to acknowledge Feng Wang, Rufeng Zhang, and Zongwei Zhou for helpful discussions.  ... 
arXiv:2111.07832v3 fatcat:rojdktyjmveapdubvxcgdcfgn4

Unsupervised Face Normalization With Extreme Pose and Expression in the Wild

Yichen Qian, Weihong Deng, Jiani Hu
2019 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)  
Face normalization provides an effective and cheap way to distill face identity and dispell face variances for recognition. We focus on face generation in the wild with unpaired data.  ...  Extensive qualitative and quantitative experiments on both controlled and in-the-wild databases demonstrate the superiority of our face normalization method. Code is available at  ...  [10] , its surprising performance on generative task has drawn substantial attention from the deep learning and computer vision community.  ... 
doi:10.1109/cvpr.2019.01008 dblp:conf/cvpr/QianDH19 fatcat:j2djoouupvgyhnjcuawvmyvk5a

Efficient Self-supervised Vision Transformers for Representation Learning [article]

Chunyuan Li, Jianwei Yang, Pengchuan Zhang, Mei Gao, Bin Xiao, Xiyang Dai, Lu Yuan, Jianfeng Gao
2022 arXiv   pre-print
This paper investigates two techniques for developing efficient self-supervised vision transformers (EsViT) for visual representation learning.  ...  Our results show that combining the two techniques, EsViT achieves 81.3% top-1 on the ImageNet linear probe evaluation, outperforming prior arts with around an order magnitude of higher throughput.  ...  (iii) Convolution vision Transformer (CvT) : Features in neighboring windows are considered in the convolutional projection in self-attentions. The window size is set to W = 7 by default.  ... 
arXiv:2106.09785v2 fatcat:ermnlfxkp5c2vm6mnrbfrf43d4

Moving Towards Centers: Re-ranking with Attention and Memory for Re-identification [article]

Yunhao Zhou, Yi Wang, Lap-Pui Chau
Then, we distill and refine the probe-related features into the Contextual Memory cell via attention mechanism.  ...  For correlation prediction, we first aggregate the contextual information for probe's k-nearest neighbors via the Transformer encoder.  ...  Second, a Contextual Memory initialized by attention mechanism distills the probe-related contextual features.  ... 
doi:10.48550/arxiv.2105.01447 fatcat:2gzsokl6pjhxdkvunyorj6yrla

Knowledge Distillation and Student-Teacher Learning for Visual Intelligence: A Review and New Outlooks [article]

Lin Wang, Kuk-Jin Yoon
2021 arXiv   pre-print
Additionally, we systematically analyze the research status of KD in vision applications.  ...  To achieve faster speeds and to handle the problems caused by the lack of data, knowledge distillation (KD) has been proposed to transfer information learned from one model to another.  ...  In such a setting, the transformation of the student's guided layer is done by a self-attention transformer. Chung et al.  ... 
arXiv:2004.05937v6 fatcat:yqzo7nylzbbn7pfhzpfc2qaxea

Recognizing Families through Images with Pretrained Encoder [article]

Tuan-Duy H. Nguyen, Huu-Nghia H. Nguyen, Hieu Dao
2020 arXiv   pre-print
retrieval in the Recognizing Family in The Wild 2020 competition.  ...  Kinship verification and kinship retrieval are emerging tasks in computer vision.  ...  INTRODUCTION Over the last few years, the application of computer vision in kinship verification has been gaining attention with many benchmark datasets released by several research groups such as KinshipW  ... 
arXiv:2005.11811v1 fatcat:jpga2x6znney7c7henrixqkjza

Masked Autoencoders Are Scalable Vision Learners [article]

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick
2021 arXiv   pre-print
This paper shows that masked autoencoders (MAE) are scalable self-supervised learners for computer vision.  ...  Transfer performance in downstream tasks outperforms supervised pre-training and shows promising scaling behavior.  ...  Training data-efficient image transformers & distillation through attention. In ICML, 2021. [54] Hugo Touvron, Alexandre Sablayrolles, Matthijs Douze, Matthieu Cord, and Hervé Jégou.  ... 
arXiv:2111.06377v3 fatcat:4d7762easfdcniz4jvqedqizqy

OODformer: Out-Of-Distribution Detection Transformer [article]

Rajat Koner, Poulami Sinhamahapatra, Karsten Roscher, Stephan Günnemann, Volker Tresp
2021 arXiv   pre-print
This paper proposes a first-of-its-kind OOD detection architecture named OODformer that leverages the contextualization capabilities of the transformer.  ...  Incorporating the trans\-former as the principal feature extractor allows us to exploit the object concepts and their discriminate attributes along with their co-occurrence via visual attention.  ...  Describing textures in the wild.  ... 
arXiv:2107.08976v2 fatcat:o23etlt6cbfmtmx52hi2ptodna

Vision Transformers are Robust Learners [article]

Sayak Paul, Pin-Yu Chen
2021 arXiv   pre-print
In this work, we study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples.  ...  Transformers, composed of multiple self-attention layers, hold strong promises toward a generic learning primitive applicable to different data modalities, including the recent breakthroughs in computer  ...  Acknowledgements We are thankful to the Google Developers Experts program 11 (specifically Soonson Kwon and Karl Weinmeister) for providing Google Cloud Platform credits to support the experiments.  ... 
arXiv:2105.07581v3 fatcat:ngw5cdn2mbcwdcsoqzo6kfas24

On Improving the Generalization of Face Recognition in the Presence of Occlusions

Xiang Xu, Nikolaos Sarafianos, Ioannis A. Kakadiaris
2020 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)  
First, an attention mechanism was proposed that extracted local identity-related region. The local features were then aggregated with the global representations to form a single template.  ...  2% in terms of rank-1 accuracy in an image-set-based scenario.  ...  In the top-down pathway, since the global features include information from the occluded region of the face, an attention module G A is proposed that distills the identity-related features from the global  ... 
doi:10.1109/cvprw50498.2020.00407 dblp:conf/cvpr/0005SK20 fatcat:uitlnvmjdzc3lmyyhc7g4uemki

Advances and Challenges in Deep Lip Reading [article]

Marzieh Oghbaie, Arian Sabaghi, Kooshan Hashemifard, Mohammad Akbari
2021 arXiv   pre-print
Advancements in these directions will expedite the transformation of silent speech interface from theory to practice. We also discuss the main modules of a VSR pipeline and the influential datasets.  ...  Driven by deep learning techniques and large-scale datasets, recent years have witnessed a paradigm shift in automatic lip reading.  ...  Knowledge Distillation in Lip Reading.  ... 
arXiv:2110.07879v1 fatcat:eimcuzdz5va3vdlgw2g7y25tki

VLUE: A Multi-Task Multi-Dimension Benchmark for Evaluating Vision-Language Pre-training

Wangchunshu Zhou, Yan Zeng, Shizhe Diao, Xinsong Zhang
2022 International Conference on Machine Learning  
Recent advances in vision-language pre-training (VLP) have demonstrated impressive performance in a range of vision-language (VL) tasks.  ...  We release the VLUE benchmark 1 to promote research on building vision-language models that generalize well to more diverse images and concepts unseen during pre-training, and are practical in terms of  ...  On top of each encoder, there is a 6-layer transformer to model the cross-modal interaction based on a cross-attention block.  ... 
dblp:conf/icml/ZhouZDZ22 fatcat:bedb3323rvg6dkahxquf37upz4

VLUE: A Multi-Task Benchmark for Evaluating Vision-Language Models [article]

Wangchunshu Zhou, Yan Zeng, Shizhe Diao, Xinsong Zhang
2022 arXiv   pre-print
Recent advances in vision-language pre-training (VLP) have demonstrated impressive performance in a range of vision-language (VL) tasks.  ...  We release the VLUE benchmark to promote research on building vision-language models that generalize well to more diverse images and concepts unseen during pre-training, and are practical in terms of efficiency-performance  ...  On top of each encoder, there is a 6-layer transformer to model the cross-modal interaction based on a cross-attention block.  ... 
arXiv:2205.15237v1 fatcat:2ytmn43x3zanjg4nmm5ahyqbby

Person search: New paradigm of person re-identification: A survey and outlook of recent works

Khawar Islam
2020 Image and Vision Computing  
Person Search (PS) has become a major field because of its need in community and in the field of research among researchers.  ...  In last few years, deep learning has played unremarkable role for the solution of re-identification problem. Deep learning shows incredible performance in person (re-ID) and search.  ...  The persons appeared in different cameras, resulting, labeled 8432 identities. PRW (Person-Re-identification in the Wild) is an extended version of Market 1501 dataset.  ... 
doi:10.1016/j.imavis.2020.103970 fatcat:g2zuqww7tbdszkxrc2wkrfno2y
« Previous Showing results 1 — 15 out of 2,745 results