Filters








32,070 Hits in 7.2 sec

Are Pretrained Convolutions Better than Pretrained Transformers?

Yi Tay, Mostafa Dehghani, Jai Prakash Gupta, Vamsi Aribandi, Dara Bahri, Zhen Qin, Donald Metzler
2021 Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)   unpublished
In the context of language models, are convolutional models competitive to Transformers when pre-trained? This paper investigates this research question and presents several interesting findings.  ...  In the era of pre-trained language models, Transformers are the de facto choice of model architectures.  ...  • RQ4: What are the failure modes, caveats and reasons to not use pre-trained convolutions? • RQ5: Are certain convolution variants better than others?  ... 
doi:10.18653/v1/2021.acl-long.335 fatcat:nvxvwxbcqvhw3hmgs6gniu5ew4

AST: Audio Spectrogram Transformer [article]

Yuan Gong, Yu-An Chung, James Glass
2021 arXiv   pre-print
In this paper, we answer the question by introducing the Audio Spectrogram Transformer (AST), the first convolution-free, purely attention-based model for audio classification.  ...  To better capture long-range global context, a recent trend is to add a self-attention mechanism on top of the CNN, forming a CNN-attention hybrid model.  ...  These Transformer models are usually referred to as convolution-free to distinguish them from CNNs [11, 12] .  ... 
arXiv:2104.01778v3 fatcat:ufm2rlzvtbfuxgjammljwuavbi

Molecule Attention Transformer [article]

Łukasz Maziarka, Tomasz Danel, Sławomir Mucha, Krzysztof Rataj, Jacek Tabor, Stanisław Jastrzębski
2020 arXiv   pre-print
Finally, we show that attention weights learned by MAT are interpretable from the chemical point of view.  ...  To move towards this goal, we propose Molecule Attention Transformer (MAT).  ...  We expect that MAT will work significantly better than a vanilla graph convolutional network if λ d is tuned well. Experimental setting.  ... 
arXiv:2002.08264v1 fatcat:ollmpvwd7fan3ayokytnv2yih4

Toward Transformer-Based Object Detection [article]

Josh Beal, Eric Kim, Eric Tzeng, Dong Huk Park, Andrew Zhai, Dmitry Kislyuk
2020 arXiv   pre-print
This naturally raises the question of whether or not transformer-based architectures such as the Vision Transformer are capable of performing tasks other than classification.  ...  The Vision Transformer was the first major attempt to apply a pure transformer model directly to images as input, demonstrating that as compared to convolutional networks, transformer-based architectures  ...  ViT-FRCNN yields better out-of-domain performance than ResNet-FRCNN approaches.  ... 
arXiv:2012.09958v1 fatcat:rhdftpryhfbrdo6npq7e4vfuse

Convolutional Bypasses Are Better Vision Transformer Adapters [article]

Shibo Jie, Zhi-Hong Deng
2022 arXiv   pre-print
ViT and only finetune these modules while the pretrained weights are frozen.  ...  In this paper, we propose to construct Convolutional Bypasses (Convpass) in ViT as adaptation modules, introducing only a small amount (less than 0.5% of model parameters) of trainable parameters to adapt  ...  First, the parallel designs are better than their sequential counterparts.  ... 
arXiv:2207.07039v3 fatcat:7l2iqgb35bbv5nb7v34fbkrira

TUNet: A Block-online Bandwidth Extension Model based on Transformers and Self-supervised Pretraining [article]

Viet-Anh Nguyen, Anh H. T. Nguyen, Andy W. H. Khong
2022 arXiv   pre-print
Pretraining and filter augmentation also help stabilize and enhance the overall performance.  ...  The proposed architecture simplifies the UNet backbone of the TFiLM to reduce inference time and employs an efficient transformer at the bottleneck to alleviate performance degradation.  ...  Compared to the masked reconstruction pretraining in [17] , both encoder and decoder are pretrained in our proposed approach.  ... 
arXiv:2110.13492v4 fatcat:43k4agveffcwzp2pn3rtwoo6fi

Convolutions are competitive with transformers for protein sequence pretraining [article]

Kevin K. Yang, Alex X. Lu, Nicolo Fusi
2022 bioRxiv   pre-print
CNNs are competitive on the pretraining task with transformers across several orders of magnitude in parameter size while scaling linearly with sequence length.  ...  We investigate the potential of a convolution-based architecture for protein sequence masked language model pretraining and subsequent finetuning.  ...  ., 2015] , 1.1 million, or 2.6%, are longer than 1022 residues.  ... 
doi:10.1101/2022.05.19.492714 fatcat:syvltnsfuvd4djnqabdmfr2m2a

Are Pre-trained Convolutions Better than Pre-trained Transformers? [article]

Yi Tay, Mostafa Dehghani, Jai Gupta, Dara Bahri, Vamsi Aribandi, Zhen Qin, Donald Metzler
2022 arXiv   pre-print
In the context of language models, are convolutional models competitive to Transformers when pre-trained? This paper investigates this research question and presents several interesting findings.  ...  In the era of pre-trained language models, Transformers are the de facto choice of model architectures.  ...  We show that convolutions are not only consistently faster (even at shorter sequences) but scale better than transformers.  ... 
arXiv:2105.03322v2 fatcat:uoaeqeky6zcbdhbq6rlmpohhle

SSAST: Self-Supervised Audio Spectrogram Transformer [article]

Yuan Gong, Cheng-I Jeff Lai, Yu-An Chung, James Glass
2022 arXiv   pre-print
The proposed self-supervised framework significantly boosts AST performance on all tasks, with an average improvement of 60.9%, leading to similar or even better results than a supervised pretrained AST  ...  Recently, neural networks based purely on self-attention, such as the Vision Transformer (ViT), have been shown to outperform deep learning models constructed with convolutional neural networks (CNNs)  ...  and audio datasets leads to better performance than using data from a single domain.  ... 
arXiv:2110.09784v2 fatcat:z3rz7pigjrbkvejzs577imc7ky

SSAST: Self-Supervised Audio Spectrogram Transformer

Yuan Gong, Cheng-I Lai, Yu-An Chung, James Glass
2022 PROCEEDINGS OF THE THIRTIETH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE AND THE TWENTY-EIGHTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE  
The proposed self-supervised framework significantly boosts AST performance on all tasks, with an average improvement of 60.9%, leading to similar or even better results than a supervised pretrained AST  ...  Recently, neural networks based purely on self-attention, such as the Vision Transformer (ViT), have been shown to outperform deep learning models constructed with convolutional neural networks (CNNs)  ...  and audio datasets leads to better performance than using data from a single domain.  ... 
doi:10.1609/aaai.v36i10.21315 fatcat:behbtt4dj5fw3cyuyerm3vqsje

BEVT: BERT Pretraining of Video Transformers [article]

Rui Wang and Dongdong Chen and Zuxuan Wu and Yinpeng Chen and Xiyang Dai and Mengchen Liu and Yu-Gang Jiang and Luowei Zhou and Lu Yuan
2022 arXiv   pre-print
This paper studies the BERT pretraining of video transformers. It is a straightforward but worth-studying extension given the recent success from BERT pretraining of image transformers.  ...  This design is motivated by two observations: 1) transformers learned on image datasets provide decent spatial priors that can ease the learning of video transformers, which are often times computationally-intensive  ...  The results are shown in Table 7 . We see that images from IM-AGENET are slightly better than those from K400, i.e. less than 0.3% on all three datasets.  ... 
arXiv:2112.01529v3 fatcat:iushdrphkzdffpmpbewhlwx4qu

SiT: Self-supervised vIsion Transformer [article]

Sara Atito and Muhammad Awais and Josef Kittler
2021 arXiv   pre-print
In natural language processing (NLP) self-supervised learning and transformers are already the methods of choice.  ...  The recent literature suggests that the transformers are becoming increasingly popular also in computer vision.  ...  • The performance achieved is significantly better than state-of-the-art self-supervised methods.  ... 
arXiv:2104.03602v2 fatcat:leyl2xvhsnbflnwohrgigxsogy

TinyViT: Fast Pretraining Distillation for Small Vision Transformers [article]

Kan Wu, Jinnian Zhang, Houwen Peng, Mengchen Liu, Bin Xiao, Jianlong Fu, Lu Yuan
2022 arXiv   pre-print
Moreover, increasing image resolutions, TinyViT can reach 86.5% accuracy, being slightly better than Swin-L while using only 11% parameters.  ...  The tiny student transformers are automatically scaled down from a large pretrained model with computation and parameter constraints. Comprehensive experiments demonstrate the efficacy of TinyViT.  ...  Under the same training recipe, our TinyViT architecture achieves better performance than Swin-T, getting 1.5% AP improvements.  ... 
arXiv:2207.10666v1 fatcat:3xkyjmockvhmbltkj3jhsur2sa

On the Surprising Effectiveness of Transformers in Low-Labeled Video Recognition [article]

Farrukh Rahman, Ömer Mubarek, Zsolt Kira
2022 arXiv   pre-print
Recently vision transformers have been shown to be competitive with convolution-based methods (CNNs) broadly across multiple vision tasks.  ...  using the predominant features of video transformer architectures.  ...  Tab. 6 Discussion Our results indicate that Video Transformers (VTs) are better learners in low-labeled video settings than CNNs.  ... 
arXiv:2209.07474v1 fatcat:zs5ukspb4jcz7hi3tiztf6pwtq

Vision Transformers for Dense Prediction [article]

René Ranftl, Alexey Bochkovskiy, Vladlen Koltun
2021 arXiv   pre-print
Our models are available at https://github.com/intel-isl/DPT.  ...  We introduce dense vision transformers, an architecture that leverages vision transformers in place of convolutional networks as a backbone for dense prediction tasks.  ...  We observe more details and also better global depth arrangement in DPT predictions when compared to the fullyconvolutional baseline.  ... 
arXiv:2103.13413v1 fatcat:nkanptalqreifakf4uf52awk4y
« Previous Showing results 1 — 15 out of 32,070 results