19,625 Hits in 4.5 sec

Masked Feature Prediction for Self-Supervised Visual Pre-Training [article]

Chen Wei, Haoqi Fan, Saining Xie, Chao-Yuan Wu, Alan Yuille, Christoph Feichtenhofer
2021 arXiv   pre-print
We present Masked Feature Prediction (MaskFeat) for self-supervised pre-training of video models.  ...  Our approach first randomly masks out a portion of the input sequence and then predicts the feature of the masked regions.  ...  We provide more qualitative results of image HOG predictions in Fig. 4 using ImageNet-1K validation images and for video HOG predictions in Fig. 5 using Kinetics-400 validation videos. D.  ... 
arXiv:2112.09133v1 fatcat:lln3q37gpjefhmctt7u6kiarl4

MVP: Multimodality-guided Visual Pre-training [article]

Longhui Wei, Lingxi Xie, Wengang Zhou, Houqiang Li, Qi Tian
2022 arXiv   pre-print
Recently, masked image modeling (MIM) has become a promising direction for visual pre-training.  ...  In the context of vision transformers, MIM learns effective visual representation by aligning the token-level features with a pre-defined space (e.g., BEIT used a d-VAE trained on a large image corpus  ...  Masked Image Modeling with Tokenizer As presented in Eqn (1), the core of self-supervised visual pre-training is to design a proper pretext task.  ... 
arXiv:2203.05175v1 fatcat:jam2kmshsrfopprwwqxysvcf2i

mc-BEiT: Multi-choice Discretization for Image BERT Pre-training [article]

Xiaotong Li, Yixiao Ge, Kun Yi, Zixuan Hu, Ying Shan, Ling-Yu Duan
2022 arXiv   pre-print
Image BERT pre-training with masked image modeling (MIM) becomes a popular practice to cope with self-supervised representation learning.  ...  Specifically, the multi-choice supervision for the masked image patches is formed by the soft probability vectors of the discrete token ids, which are predicted by the off-the-shelf image tokenizer and  ...  Self-supervised visual pre-training In the past few years, various pretext tasks are designed for self-supervised visual pre-training.  ... 
arXiv:2203.15371v2 fatcat:kype74wae5esbfg2oc32wu7pxa

MST: Masked Self-Supervised Transformer for Visual Representation [article]

Zhaowen Li, Zhiyang Chen, Fan Yang, Wei Li, Yousong Zhu, Chaoyang Zhao, Rui Deng, Liwei Wu, Rui Zhao, Ming Tang, Jinqiao Wang
2021 arXiv   pre-print
Transformer has been widely used for self-supervised pre-training in Natural Language Processing (NLP) and achieved great success.  ...  However, it has not been fully explored in visual self-supervised learning.  ...  After self-supervised pre-training, we remove the MLP heads and train a supervised linear classifier on frozen features.  ... 
arXiv:2106.05656v2 fatcat:gnlxfm5a7veupgq4oex5s3ph3i

MEmoBERT: Pre-training Model with Prompt-based Learning for Multimodal Emotion Recognition [article]

Jinming Zhao, Ruichen Li, Qin Jin, Xinchao Wang, Haizhou Li
2021 arXiv   pre-print
In this paper, we propose a pre-training model MEmoBERT for multimodal emotion recognition, which learns multimodal joint representations through self-supervised learning from large-scale unlabeled video  ...  Furthermore, unlike the conventional "pre-train, finetune" paradigm, we propose a prompt-based method that reformulates the downstream emotion classification task as a masked text prediction one, bringing  ...  We design four efficient self-supervised pre-training tasks to learn joint multimodal emotional representations, including Whole Word Masked Language Modeling (WWMLM), Span Masked Visual Frame Modeling  ... 
arXiv:2111.00865v1 fatcat:pzlft4ufwzgplb6gceehaz4sxm

Self-Supervised learning with cross-modal transformers for emotion recognition [article]

Aparna Khare, Srinivas Parthasarathy, Shiva Sundaram
2020 arXiv   pre-print
We learn multi-modal representations using a transformer trained on the masked language modeling task with audio, visual and text features.  ...  In this work, we extend self-supervised training to multi-modal applications.  ...  In Section 2 we describe our model architecture and the self-supervised approach for pre-training, along with further motivation for the self-supervised learning we choose.  ... 
arXiv:2011.10652v1 fatcat:lkpk5tdlzbhulghtudxzgxtqni

An Empirical Study Of Self-supervised Learning Approaches For Object Detection With Transformers [article]

Gokul Karthik Kumar, Sahal Shaji Mullappilly, Abhishek Singh Gehlot
2022 arXiv   pre-print
However, the CNN feature maps still maintain the spatial relationship and we utilize this property to design self-supervised learning approaches to train the encoder of object detection transformers in  ...  Self-supervised learning (SSL) methods such as masked language modeling have shown massive performance gains by pretraining transformer models for a variety of natural language processing tasks.  ...  The objective of the pre-training task is to predict the visual tokens of the original image based on the encoding vectors of the masked image.  ... 
arXiv:2205.05543v1 fatcat:tquw65c7bbdvhhdtr5q545hohq

Crystal Twins: Self-supervised Learning for Crystalline Material Property Prediction [article]

Rishikesh Magar, Yuyang Wang, Amir Barati Farimani
2022 arXiv   pre-print
By sharing the pre-trained weights when fine-tuning the GNN for regression tasks, we significantly improve the performance for 7 challenging material property prediction benchmarks  ...  Recent advances in Self-Supervised Learning (SSL) frameworks capable of training ML models on unlabeled data have mitigated this problem and demonstrated superior performance in computer vision and natural  ...  The authors would like to thank Prakarsh Yadav and Alison Bartsch for their comments on the manuscript.  ... 
arXiv:2205.01893v1 fatcat:buwqqf74yrchhgvfbymumeywrm

Dense Contrastive Visual-Linguistic Pretraining [article]

Lei Shi, Kai Shuang, Shijie Geng, Peng Gao, Zuohui Fu, Gerard de Melo, Yunpeng Chen, Sen Su
2021 arXiv   pre-print
Overall, DCVLP allows cross-modality dense region contrastive learning in a self-supervised setting independent of any object annotations.  ...  In particular, LXMERT and UNITER adopt visual region feature regression and label classification as pretext tasks.  ...  Inspired by this, Visual-Linguistic Pretraining (VLP) has been proposed to learn multimodal models covering both vision and language, by adding extra masked prediction self-supervised strategies for the  ... 
arXiv:2109.11778v1 fatcat:sown4wcp45c5dpizfyrcrnsks4

Self-Supervised Visual Representations Learning by Contrastive Mask Prediction [article]

Yucheng Zhao, Guangting Wang, Chong Luo, Wenjun Zeng, Zheng-Jun Zha
2021 arXiv   pre-print
In this paper, we propose a novel contrastive mask prediction (CMP) task for visual representation learning and design a mask contrast (MaskCo) framework to implement the idea.  ...  Advanced self-supervised visual representation learning methods rely on the instance discrimination (ID) pretext task.  ...  Comparison of MaskCo to previous supervised/self-supervised pre-training methods on ImageNet pre-training datasets.  ... 
arXiv:2108.07954v1 fatcat:4jxz57sn6bbsjcikknkqilq3uu

data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language [article]

Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, Michael Auli
2022 arXiv   pre-print
To get us closer to general self-supervised learning, we present data2vec, a framework that uses the same learning method for either speech, NLP or computer vision.  ...  The core idea is to predict latent representations of the full input data based on a masked view of the input in a self-distillation setup using a standard Transformer architecture.  ...  Acknowledgements We thank Brenden Lake, Dhruv Batra and Marco Baroni for helpful discussions. We also thank Laurens van der Maaten for feedback on an earlier version of the paper.  ... 
arXiv:2202.03555v2 fatcat:vvcfmb46vvgglialffrynmym24

Self-supervised pre-training and contrastive representation learning for multiple-choice video QA [article]

Seonhoon Kim, Seohyeong Jeong, Eunbyul Kim, Inho Kang, Nojun Kwak
2020 arXiv   pre-print
In this paper, we propose novel training schemes for multiple-choice video question answering with a self-supervised pre-training stage and a supervised contrastive learning in the main stage as an auxiliary  ...  In the self-supervised pre-training stage, we transform the original problem format of predicting the correct answer into the one that predicts the relevant question to provide a model with broader contextual  ...  We set the learning rate to 1e-5 for the self-supervised pre-training stage and 5e-5 for the main QA stage.  ... 
arXiv:2009.08043v2 fatcat:hbnzmwqmknb5hatjhbdvmptazu

Unsupervised Vision-and-Language Pre-training Without Parallel Images and Captions [article]

Liunian Harold Li, Haoxuan You, Zhecan Wang, Alireza Zareian, Shih-Fu Chang, Kai-Wei Chang
2021 arXiv   pre-print
Our work challenges the widely held notion that aligned data is necessary for V&L pre-training, while significantly reducing the amount of supervision needed for V&L models.  ...  In particular, we propose to conduct "mask-and-predict" pre-training on text-only and image-only corpora and introduce the object tags detected by an object recognition model as anchor points to bridge  ...  amount of aligned text-image pairs for "mask-and-predict" pre-training.  ... 
arXiv:2010.12831v2 fatcat:ftyzelmc35dg3fwckci4kh5we4

Contrastive Visual-Linguistic Pretraining [article]

Lei Shi, Kai Shuang, Shijie Geng, Peng Su, Zhengkai Jiang, Peng Gao, Zuohui Fu, Gerard de Melo, Sen Su
2020 arXiv   pre-print
To overcome these issues, we propose unbiased Contrastive Visual-Linguistic Pretraining (CVLP), which constructs a visual self-supervised loss built upon contrastive learning.  ...  However, as ViLBERT and LXMERT adopt visual region regression and classification loss, they often suffer from domain gap and noisy label problems, based on the visual features having been pretrained on  ...  The two prominent VLP methods LXMERT [4] and ViLBERT [5] usually perform feature regression or classification for masked visual regions as the pretext task of self-supervised learning.  ... 
arXiv:2007.13135v1 fatcat:z2p3beypund2raq765k2trclq4

A Survey of Visual Transformers [article]

Yang Liu, Yao Zhang, Yixin Wang, Feng Hou, Jin Yuan, Jiang Tian, Yang Zhang, Zhongchao Shi, Jianping Fan, Zhiqiang He
2022 arXiv   pre-print
Because of their differences on training settings and dedicated vision tasks, we have also evaluated and compared all these existing visual Transformers under different configurations.  ...  Finally, three promising research directions are suggested for future investment.  ...  Improving set prediction with other label assignments and losses may be helpful for new detection frameworks. 2) Self-Supervised Learning: Self-supervised pre-training of Transformers has standardized  ... 
arXiv:2111.06091v3 fatcat:a3fq6lvvzzgglb3qtus5qwrwpe
« Previous Showing results 1 — 15 out of 19,625 results