Filters








87,532 Hits in 4.0 sec

Learning to Prompt for Vision-Language Models [article]

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, Ziwei Liu
2022 arXiv   pre-print
Inspired by recent advances in prompt learning research in natural language processing (NLP), we propose Context Optimization (CoOp), a simple approach specifically for adapting CLIP-like vision-language  ...  Large pre-trained vision-language models like CLIP have shown great potential in learning representations that are transferable across a wide range of downstream tasks.  ...  (CoOp) 1 to automate prompt engineering, specifically for pre-trained vision-language models.  ... 
arXiv:2109.01134v4 fatcat:sz24avig5bhcpfkmmov6a64ofm

Learning to Prompt for Open-Vocabulary Object Detection with Vision-Language Model [article]

Yu Du, Fangyun Wei, Zihe Zhang, Miaojing Shi, Yue Gao, Guoqi Li
2022 arXiv   pre-print
In this paper, we introduce a novel method, detection prompt (DetPro), to learn continuous prompt representations for open-vocabulary object detection based on the pre-trained vision-language model.  ...  The class text embedding is firstly generated by feeding prompts to the text encoder of a pre-trained vision-language model.  ...  vision-language model.  ... 
arXiv:2203.14940v1 fatcat:ptcpum4ckrcuvjla5mrpzajei4

VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts [article]

Renrui Zhang, Longtian Qiu, Wei Zhang, Ziyao Zeng
2021 arXiv   pre-print
In this paper, we propose VT-CLIP to enhance vision-language modeling via visual-guided texts.  ...  Contrastive Vision-Language Pre-training (CLIP) has drown increasing attention recently for its transferable visual representation learning.  ...  Learning to prompt for vision-language models. arXiv [25] X. Li, X. Yin, C. Li, P. Zhang, X. Hu, L. Zhang, L. Wang, H. preprint arXiv:2109.01134, 2021. 3 Hu, L. Dong, and F. Wei.  ... 
arXiv:2112.02399v1 fatcat:pk7gjz5ewnfdvayab7ljiwpfli

Prompt-based Learning for Unpaired Image Captioning [article]

Peipei Zhu, Xiao Wang, Lin Zhu, Zhenglong Sun, Weishi Zheng, Yaowei Wang, Changwen Chen
2022 arXiv   pre-print
Recent successes of Vision-Language Pre-Trained Models (VL-PTMs) have triggered the development of prompt-based learning from VL-PTMs.  ...  We present in this paper a novel scheme based on prompt to train the UIC model, making best use of the powerful generalization ability and abundant vision-language prior knowledge learned under VL-PTMs  ...  Prompt Generation is used to draw prior knowledge from pre-trained vision-language models, i.e., the prompts for UIC task.  ... 
arXiv:2205.13125v1 fatcat:n6cw6ff4c5dtnde6t3gclrpe34

Unsupervised Prompt Learning for Vision-Language Models [article]

Tony Huang, Jack Chu, Fangyun Wei
2022 arXiv   pre-print
To avoid laborious prompt engineering and simultaneously improve transfer performance, recent works such as CoOp, CLIP-Adapter and Tip-Adapter propose to adapt vision-language models for downstream image  ...  Contrastive vision-language models like CLIP have shown great progress in zero-shot transfer learning.  ...  As far as we know, UPL is the first work to introduce unsupervised learning into prompt learning of vision-language models. 2 Related Work Vision-Language Models Vision-language models pre-trained on  ... 
arXiv:2204.03649v1 fatcat:sdc5uvlcjjhkrats2m5kfexo5i

Exploring Visual Prompts for Adapting Large-Scale Models [article]

Hyojin Bahng, Ali Jahanian, Swami Sankaranarayanan, Phillip Isola
2022 arXiv   pre-print
We investigate the efficacy of visual prompting to adapt large-scale models in vision.  ...  Following the recent approach from prompt tuning and adversarial reprogramming, we learn a single image perturbation such that a frozen model prompted with this perturbation performs a new task.  ...  Acknowledgements We would like to thank Lucy Chai, Caroline Chan, Joanna Materzynska, Xavier Puig Fernandez, Minyoung Huh, Tongzhou Wang, and Yen-Chen Lin for proofreading the paper.  ... 
arXiv:2203.17274v2 fatcat:tkmhhqgt6vdz7igktf6sr2tpq4

A Good Prompt Is Worth Millions of Parameters: Low-resource Prompt-based Learning for Vision-Language Models [article]

Woojeong Jin, Yu Cheng, Yelong Shen, Weizhu Chen, Xiang Ren
2022 arXiv   pre-print
Large pre-trained vision-language (VL) models can learn a new task with a handful of examples and generalize to a new task without fine-tuning.  ...  For FewVLM, we pre-train a sequence-to-sequence transformer model with prefix language modeling (PrefixLM) and masked language modeling (MaskedLM).  ...  Related Work Vision-language few-shot learning.  ... 
arXiv:2110.08484v2 fatcat:dsfqfdvlhbenpfkgjgqlyulaba

DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting [article]

Yongming Rao, Wenliang Zhao, Guangyi Chen, Yansong Tang, Zheng Zhu, Guan Huang, Jie Zhou, Jiwen Lu
2022 arXiv   pre-print
By further using the contextual information from the image to prompt the language model, we are able to facilitate our model to better exploit the pre-trained knowledge.  ...  Recent progress has shown that large-scale pre-training using contrastive image-text pairs can be a promising alternative for high-quality visual representation learning from natural language supervision  ...  Several learning-based prompting methods [13, 51, 56, 60] are proposed to modify the output of the language model to better adapt to the new tasks.  ... 
arXiv:2112.01518v2 fatcat:vpy4fs655jdcnmf332sur62azm

Multimodal Few-Shot Object Detection with Meta-Learning Based Cross-Modal Prompting [article]

Guangxing Han, Jiawei Ma, Shiyuan Huang, Long Chen, Rama Chellappa, Shih-Fu Chang
2022 arXiv   pre-print
We first show that meta-learning and prompt-based learning, the most commonly-used methods for few-shot learning and zero-shot transferring from pre-trained vision-language models to downstream tasks,  ...  Specifically, to better exploit the pre-trained vision-language models, the meta-learning based cross-modal prompting is proposed to generate soft prompts and further used to extract the semantic prototype  ...  Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation therein.  ... 
arXiv:2204.07841v1 fatcat:uxrlum2clzhhjbh67iugwcvwnu

Unified Multimodal Pre-training and Prompt-based Tuning for Vision-Language Understanding and Generation [article]

Tianyi Liu, Zuxuan Wu, Wenhan Xiong, Jingjing Chen, Yu-Gang Jiang
2021 arXiv   pre-print
To tackle this problem, we propose Unified multimodal pre-training for both Vision-Language understanding and generation (UniVL).  ...  Most existing vision-language pre-training methods focus on understanding tasks and use BERT-like objectives (masked language modeling and image-text matching) during pretraining.  ...  Existing vision-language methods employ BERT-like objectives, such as masked language modeling and image-text matching [6, 23, 26, 41] to learn multimodal representations.  ... 
arXiv:2112.05587v2 fatcat:zgm6jrqz3rctrktm3z6ncw6jj4

CLIP-Adapter: Better Vision-Language Models with Feature Adapters [article]

Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, Yu Qiao
2021 arXiv   pre-print
In this paper, we show that there is an alternative path to achieve better vision-language models other than prompt tuning.  ...  While prompt tuning is for the textual inputs, we propose CLIP-Adapter to conduct fine-tuning with feature adapters on either visual or language branch.  ...  In this paper, we demonstrate that prompt tuning is not the only path to better vision-language models.  ... 
arXiv:2110.04544v1 fatcat:4zzqfhgcqncpjkwamgyab3s4cm

Image Captioning In the Transformer Age [article]

Yang Xu, Li Li, Haiyang Xu, Songfang Huang, Fei Huang, Jianfei Cai
2022 arXiv   pre-print
This drawback inspires the researchers to develop a homogeneous architecture that facilitates end-to-end training, for which Transformer is the perfect one that has proven its huge potential in both vision  ...  The success of these large-scale models seems to weaken the importance of the single IC task.  ...  x, prompt-based learning unifies various NLP tasks as the single one: language modeling.  ... 
arXiv:2204.07374v1 fatcat:ftsoam2ei5da5fkygq4pztzxda

ActionCLIP: A New Paradigm for Video Action Recognition [article]

Mengmeng Wang, Jiazheng Xing, Yong Liu
2021 arXiv   pre-print
Specifically, we model this task as a video-text matching problem within a multimodal learning framework, which strengthens the video representation with more semantic language supervision and enables  ...  Moreover, to handle the deficiency of label texts and make use of tremendous web data, we propose a new paradigm based on this multimodal learning framework for action recognition, which we dub "pre-train  ...  Acknowledgements We would like to thank Zeyi Huang for his constructive suggestions and comments on this work.  ... 
arXiv:2109.08472v1 fatcat:dwtb4xtf6bcbfmbjq5eif5hzgi

CLIP Models are Few-shot Learners: Empirical Studies on VQA and Visual Entailment [article]

Haoyu Song, Li Dong, Wei-Nan Zhang, Ting Liu, Furu Wei
2022 arXiv   pre-print
However, after being pre-trained by language supervision from a large amount of image-caption pairs, CLIP itself should also have acquired some few-shot abilities for vision-language tasks.  ...  In this work, we empirically show that CLIP can be a strong vision-language few-shot learner by leveraging the power of language.  ...  Figure 1 : Examples of the two vision-language understanding tasks. For VQA, language prompts are used.  ... 
arXiv:2203.07190v1 fatcat:whf2ljh2mjfa5l4wsbr5dpvktq

Inferring Offensiveness In Images From Natural Language Supervision [article]

Patrick Schramowski, Kristian Kersting
2021 arXiv   pre-print
Probing or fine-tuning (large-scale) pre-trained models results in state-of-the-art performance for many NLP tasks and, more recently, even for computer vision tasks when combined with image data.  ...  Based on human-annotated examples and the implicit knowledge of a CLIP based model, we demonstrate that one can select relevant prompts for rating the offensiveness of an image.  ...  To apply transfer learning, the most popular deep learning frameworks provide downloadable pre-trained models for ImageNet1k.  ... 
arXiv:2110.04222v1 fatcat:dyc7qhyfc5e2nkllgezbg7ullq
« Previous Showing results 1 — 15 out of 87,532 results