Filters








75,173 Hits in 4.3 sec

Conditional Prompt Learning for Vision-Language Models [article]

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, Ziwei Liu
2022 arXiv   pre-print
A recently proposed method named Context Optimization (CoOp) introduces the concept of prompt learning -- a recent trend in NLP -- to the vision domain for adapting pre-trained vision-language models.  ...  With the rise of powerful pre-trained vision-language models like CLIP, it becomes essential to investigate ways to adapt these models to downstream datasets.  ...  [62] have recently explored the concept of prompt learning-a recent trend in NLP [15, 25, 30, 32, 44, 60] -for adapting pre-trained vision-language models.  ... 
arXiv:2203.05557v1 fatcat:tfflh77tavdbhkytwd4usznx2a

Prefix Conditioning Unifies Language and Label Supervision [article]

Kuniaki Saito, Kihyuk Sohn, Xiang Zhang, Chun-Liang Li, Chen-Yu Lee, Kate Saenko, Tomas Pfister
2022 arXiv   pre-print
Vision-language contrastive learning suggests a new learning paradigm by leveraging a large amount of image-caption-pair data.  ...  However, a naive unification of the real caption and the prompt sentences could lead to a complication in learning, as the distribution shift in text may not be handled properly in the language encoder  ...  Related Work Vision-Language Contrastive Learning.  ... 
arXiv:2206.01125v1 fatcat:cbrtt4iwwrdbzhn3os6gavizhu

Exploring Visual Prompts for Adapting Large-Scale Models [article]

Hyojin Bahng, Ali Jahanian, Swami Sankaranarayanan, Phillip Isola
2022 arXiv   pre-print
We investigate the efficacy of visual prompting to adapt large-scale models in vision.  ...  Following the recent approach from prompt tuning and adversarial reprogramming, we learn a single image perturbation such that a frozen model prompted with this perturbation performs a new task.  ...  We thank Judy Hoffman for helpful discussion and advice. This work was partially supported by funding from MIT STL and an MIT RSC award from the NEC fund.  ... 
arXiv:2203.17274v2 fatcat:tkmhhqgt6vdz7igktf6sr2tpq4

Language-biased image classification: evaluation based on semantic representations [article]

Yoann Lemesle, Masataka Sawayama, Guillermo Valle-Perez, Maxime Adolphe, Hélène Sauzéon, Pierre-Yves Oudeyer
2022 arXiv   pre-print
of language and vision.  ...  Humans show language-biased image recognition for a word-embedded image, known as picture-word interference.  ...  learning of language and vision.  ... 
arXiv:2201.11014v2 fatcat:cuvpzavlfbaf3hujnpytsy6k6e

Image Captioning In the Transformer Age [article]

Yang Xu, Li Li, Haiyang Xu, Songfang Huang, Fei Huang, Jianfei Cai
2022 arXiv   pre-print
This drawback inspires the researchers to develop a homogeneous architecture that facilitates end-to-end training, for which Transformer is the perfect one that has proven its huge potential in both vision  ...  and language domains and thus can be used as the basic component of the visual encoder and language decoder in an IC pipeline.  ...  x, prompt-based learning unifies various NLP tasks as the single one: language modeling.  ... 
arXiv:2204.07374v1 fatcat:ftsoam2ei5da5fkygq4pztzxda

CLIP-Adapter: Better Vision-Language Models with Feature Adapters [article]

Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, Yu Qiao
2021 arXiv   pre-print
In this paper, we show that there is an alternative path to achieve better vision-language models other than prompt tuning.  ...  Large-scale contrastive vision-language pre-training has shown significant progress in visual representation learning.  ...  In this paper, we propose a different approach for better adapting vision-language models with feature adapters instead of prompt tuning.  ... 
arXiv:2110.04544v1 fatcat:4zzqfhgcqncpjkwamgyab3s4cm

Flamingo: a Visual Language Model for Few-Shot Learning [article]

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford (+15 others)
2022 arXiv   pre-print
For tasks lying anywhere on this spectrum, we demonstrate that a single Flamingo model can achieve a new state of the art for few-shot learning, simply by prompting the model with task-specific examples  ...  Flamingo models include key architectural innovations to: (i) bridge powerful pretrained vision-only and language-only models, (ii) handle sequences of arbitrarily interleaved visual and textual data,  ...  Acknowledgements We would like to thank many of our colleagues for useful discussions, suggestions, feedback, and advice, including: Relja Arandjelović, Kareem Ayoub, Lorrayne Bennett, Adria Recasens Continente  ... 
arXiv:2204.14198v1 fatcat:5f4uhdmaibhm7cn3zetspjev3q

Multimodal Few-Shot Object Detection with Meta-Learning Based Cross-Modal Prompting [article]

Guangxing Han, Jiawei Ma, Shiyuan Huang, Long Chen, Rama Chellappa, Shih-Fu Chang
2022 arXiv   pre-print
We first show that meta-learning and prompt-based learning, the most commonly-used methods for few-shot learning and zero-shot transferring from pre-trained vision-language models to downstream tasks,  ...  Specifically, to better exploit the pre-trained vision-language models, the meta-learning based cross-modal prompting is proposed to generate soft prompts and further used to extract the semantic prototype  ...  Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation therein.  ... 
arXiv:2204.07841v1 fatcat:uxrlum2clzhhjbh67iugwcvwnu

ActionCLIP: A New Paradigm for Video Action Recognition [article]

Mengmeng Wang, Jiazheng Xing, Yong Liu
2021 arXiv   pre-print
Specifically, we model this task as a video-text matching problem within a multimodal learning framework, which strengthens the video representation with more semantic language supervision and enables  ...  , prompt and fine-tune".  ...  Acknowledgements We would like to thank Zeyi Huang for his constructive suggestions and comments on this work.  ... 
arXiv:2109.08472v1 fatcat:dwtb4xtf6bcbfmbjq5eif5hzgi

Unified Multimodal Pre-training and Prompt-based Tuning for Vision-Language Understanding and Generation [article]

Tianyi Liu, Zuxuan Wu, Wenhan Xiong, Jingjing Chen, Yu-Gang Jiang
2021 arXiv   pre-print
To tackle this problem, we propose Unified multimodal pre-training for both Vision-Language understanding and generation (UniVL).  ...  Most existing vision-language pre-training methods focus on understanding tasks and use BERT-like objectives (masked language modeling and image-text matching) during pretraining.  ...  Existing vision-language methods employ BERT-like objectives, such as masked language modeling and image-text matching [6, 23, 26, 41] to learn multimodal representations.  ... 
arXiv:2112.05587v2 fatcat:zgm6jrqz3rctrktm3z6ncw6jj4

Language Models are General-Purpose Interfaces [article]

Yaru Hao, Haoyu Song, Li Dong, Shaohan Huang, Zewen Chi, Wenhui Wang, Shuming Ma, Furu Wei
2022 arXiv   pre-print
Experimental results across various language-only and vision-language benchmarks show that our model outperforms or is competitive with specialized models on finetuning, zero-shot generalization, and few-shot  ...  A collection of pretrained encoders perceive diverse modalities (such as vision, and language), and they dock with a language model that plays the role of a universal task layer.  ...  for vision-language pretraining.  ... 
arXiv:2206.06336v1 fatcat:m63fbkoctzhbnfl3vldtb42ikq

CLIP Models are Few-shot Learners: Empirical Studies on VQA and Visual Entailment [article]

Haoyu Song, Li Dong, Wei-Nan Zhang, Ting Liu, Furu Wei
2022 arXiv   pre-print
However, after being pre-trained by language supervision from a large amount of image-caption pairs, CLIP itself should also have acquired some few-shot abilities for vision-language tasks.  ...  In this work, we empirically show that CLIP can be a strong vision-language few-shot learner by leveraging the power of language.  ...  Figure 1 : Examples of the two vision-language understanding tasks. For VQA, language prompts are used.  ... 
arXiv:2203.07190v1 fatcat:whf2ljh2mjfa5l4wsbr5dpvktq

A Unified Sequence Interface for Vision Tasks [article]

Ting Chen, Saurabh Saxena, Lala Li, Tsung-Yi Lin, David J. Fleet, Geoffrey Hinton
2022 arXiv   pre-print
While language tasks are naturally expressed in a single, unified, modeling framework, i.e., generating sequences of tokens, this has not been the case in computer vision.  ...  As a result, there is a proliferation of distinct architectures and loss functions for different vision tasks.  ...  Acknowledgements We specially thank Wei Li for their helpful feedback on the initial draft.  ... 
arXiv:2206.07669v1 fatcat:erfjwf5yivbinahb4ypg2nl2oq

Multimodal Few-Shot Learning with Frozen Language Models [article]

Maria Tsimpoukelli, Jacob Menick, Serkan Cabi, S. M. Ali Eslami, Oriol Vinyals, Felix Hill
2021 arXiv   pre-print
When trained at sufficient scale, auto-regressive language models exhibit the notable ability to learn a new language task after being prompted with just a few examples.  ...  Here, we present a simple, yet effective, approach for transferring this few-shot learning ability to a multimodal setting (vision and language).  ...  Acknowledgements We wish to thank Sebastian Borgeaud and Jack Rae for preparing the pretraining text dataset and pretraining a selection of transformer language models, as well as Trevor Cai for help with  ... 
arXiv:2106.13884v2 fatcat:t5jhbtqgc5dn7gbzr2cxnpqjdq

DualCoOp: Fast Adaptation to Multi-Label Recognition with Limited Annotations [article]

Ximeng Sun, Ping Hu, Kate Saenko
2022 arXiv   pre-print
Since DualCoOp only introduces a very light learnable overhead upon the pretrained vision-language framework, it can quickly adapt to multi-label recognition tasks that have limited annotations and even  ...  Recent work learns an alignment between textual and visual spaces to compensate for insufficient image labels, but loses accuracy because of the limited amount of available MLR annotations.  ...  Prompt Learning for Vision-Language Models. Vision-Language Models [22, 46] based on contrastive learning have demonstrated impressive ability to learn generic visual representations.  ... 
arXiv:2206.09541v1 fatcat:jelxeuzavzdk5lwpii6xhn2jeq
« Previous Showing results 1 — 15 out of 75,173 results