A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2022; you can also visit the original URL.
The file type is application/pdf
.
Filters
Conditional Prompt Learning for Vision-Language Models
[article]
2022
arXiv
pre-print
A recently proposed method named Context Optimization (CoOp) introduces the concept of prompt learning -- a recent trend in NLP -- to the vision domain for adapting pre-trained vision-language models. ...
With the rise of powerful pre-trained vision-language models like CLIP, it becomes essential to investigate ways to adapt these models to downstream datasets. ...
[62] have recently explored the concept of prompt learning-a recent trend in NLP [15, 25, 30, 32, 44, 60] -for adapting pre-trained vision-language models. ...
arXiv:2203.05557v1
fatcat:tfflh77tavdbhkytwd4usznx2a
Prefix Conditioning Unifies Language and Label Supervision
[article]
2022
arXiv
pre-print
Vision-language contrastive learning suggests a new learning paradigm by leveraging a large amount of image-caption-pair data. ...
However, a naive unification of the real caption and the prompt sentences could lead to a complication in learning, as the distribution shift in text may not be handled properly in the language encoder ...
Related Work Vision-Language Contrastive Learning. ...
arXiv:2206.01125v1
fatcat:cbrtt4iwwrdbzhn3os6gavizhu
Exploring Visual Prompts for Adapting Large-Scale Models
[article]
2022
arXiv
pre-print
We investigate the efficacy of visual prompting to adapt large-scale models in vision. ...
Following the recent approach from prompt tuning and adversarial reprogramming, we learn a single image perturbation such that a frozen model prompted with this perturbation performs a new task. ...
We thank Judy Hoffman for helpful discussion and advice. This work was partially supported by funding from MIT STL and an MIT RSC award from the NEC fund. ...
arXiv:2203.17274v2
fatcat:tkmhhqgt6vdz7igktf6sr2tpq4
Language-biased image classification: evaluation based on semantic representations
[article]
2022
arXiv
pre-print
of language and vision. ...
Humans show language-biased image recognition for a word-embedded image, known as picture-word interference. ...
learning of language and vision. ...
arXiv:2201.11014v2
fatcat:cuvpzavlfbaf3hujnpytsy6k6e
Image Captioning In the Transformer Age
[article]
2022
arXiv
pre-print
This drawback inspires the researchers to develop a homogeneous architecture that facilitates end-to-end training, for which Transformer is the perfect one that has proven its huge potential in both vision ...
and language domains and thus can be used as the basic component of the visual encoder and language decoder in an IC pipeline. ...
x, prompt-based learning unifies various NLP tasks as the single one: language modeling. ...
arXiv:2204.07374v1
fatcat:ftsoam2ei5da5fkygq4pztzxda
CLIP-Adapter: Better Vision-Language Models with Feature Adapters
[article]
2021
arXiv
pre-print
In this paper, we show that there is an alternative path to achieve better vision-language models other than prompt tuning. ...
Large-scale contrastive vision-language pre-training has shown significant progress in visual representation learning. ...
In this paper, we propose a different approach for better adapting vision-language models with feature adapters instead of prompt tuning. ...
arXiv:2110.04544v1
fatcat:4zzqfhgcqncpjkwamgyab3s4cm
Flamingo: a Visual Language Model for Few-Shot Learning
[article]
2022
arXiv
pre-print
For tasks lying anywhere on this spectrum, we demonstrate that a single Flamingo model can achieve a new state of the art for few-shot learning, simply by prompting the model with task-specific examples ...
Flamingo models include key architectural innovations to: (i) bridge powerful pretrained vision-only and language-only models, (ii) handle sequences of arbitrarily interleaved visual and textual data, ...
Acknowledgements We would like to thank many of our colleagues for useful discussions, suggestions, feedback, and advice, including: Relja Arandjelović, Kareem Ayoub, Lorrayne Bennett, Adria Recasens Continente ...
arXiv:2204.14198v1
fatcat:5f4uhdmaibhm7cn3zetspjev3q
Multimodal Few-Shot Object Detection with Meta-Learning Based Cross-Modal Prompting
[article]
2022
arXiv
pre-print
We first show that meta-learning and prompt-based learning, the most commonly-used methods for few-shot learning and zero-shot transferring from pre-trained vision-language models to downstream tasks, ...
Specifically, to better exploit the pre-trained vision-language models, the meta-learning based cross-modal prompting is proposed to generate soft prompts and further used to extract the semantic prototype ...
Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation therein. ...
arXiv:2204.07841v1
fatcat:uxrlum2clzhhjbh67iugwcvwnu
ActionCLIP: A New Paradigm for Video Action Recognition
[article]
2021
arXiv
pre-print
Specifically, we model this task as a video-text matching problem within a multimodal learning framework, which strengthens the video representation with more semantic language supervision and enables ...
, prompt and fine-tune". ...
Acknowledgements We would like to thank Zeyi Huang for his constructive suggestions and comments on this work. ...
arXiv:2109.08472v1
fatcat:dwtb4xtf6bcbfmbjq5eif5hzgi
Unified Multimodal Pre-training and Prompt-based Tuning for Vision-Language Understanding and Generation
[article]
2021
arXiv
pre-print
To tackle this problem, we propose Unified multimodal pre-training for both Vision-Language understanding and generation (UniVL). ...
Most existing vision-language pre-training methods focus on understanding tasks and use BERT-like objectives (masked language modeling and image-text matching) during pretraining. ...
Existing vision-language methods employ BERT-like objectives, such as masked language modeling and image-text matching [6, 23, 26, 41] to learn multimodal representations. ...
arXiv:2112.05587v2
fatcat:zgm6jrqz3rctrktm3z6ncw6jj4
Language Models are General-Purpose Interfaces
[article]
2022
arXiv
pre-print
Experimental results across various language-only and vision-language benchmarks show that our model outperforms or is competitive with specialized models on finetuning, zero-shot generalization, and few-shot ...
A collection of pretrained encoders perceive diverse modalities (such as vision, and language), and they dock with a language model that plays the role of a universal task layer. ...
for vision-language pretraining. ...
arXiv:2206.06336v1
fatcat:m63fbkoctzhbnfl3vldtb42ikq
CLIP Models are Few-shot Learners: Empirical Studies on VQA and Visual Entailment
[article]
2022
arXiv
pre-print
However, after being pre-trained by language supervision from a large amount of image-caption pairs, CLIP itself should also have acquired some few-shot abilities for vision-language tasks. ...
In this work, we empirically show that CLIP can be a strong vision-language few-shot learner by leveraging the power of language. ...
Figure 1 : Examples of the two vision-language understanding tasks. For VQA, language prompts are used. ...
arXiv:2203.07190v1
fatcat:whf2ljh2mjfa5l4wsbr5dpvktq
A Unified Sequence Interface for Vision Tasks
[article]
2022
arXiv
pre-print
While language tasks are naturally expressed in a single, unified, modeling framework, i.e., generating sequences of tokens, this has not been the case in computer vision. ...
As a result, there is a proliferation of distinct architectures and loss functions for different vision tasks. ...
Acknowledgements We specially thank Wei Li for their helpful feedback on the initial draft. ...
arXiv:2206.07669v1
fatcat:erfjwf5yivbinahb4ypg2nl2oq
Multimodal Few-Shot Learning with Frozen Language Models
[article]
2021
arXiv
pre-print
When trained at sufficient scale, auto-regressive language models exhibit the notable ability to learn a new language task after being prompted with just a few examples. ...
Here, we present a simple, yet effective, approach for transferring this few-shot learning ability to a multimodal setting (vision and language). ...
Acknowledgements We wish to thank Sebastian Borgeaud and Jack Rae for preparing the pretraining text dataset and pretraining a selection of transformer language models, as well as Trevor Cai for help with ...
arXiv:2106.13884v2
fatcat:t5jhbtqgc5dn7gbzr2cxnpqjdq
DualCoOp: Fast Adaptation to Multi-Label Recognition with Limited Annotations
[article]
2022
arXiv
pre-print
Since DualCoOp only introduces a very light learnable overhead upon the pretrained vision-language framework, it can quickly adapt to multi-label recognition tasks that have limited annotations and even ...
Recent work learns an alignment between textual and visual spaces to compensate for insufficient image labels, but loses accuracy because of the limited amount of available MLR annotations. ...
Prompt Learning for Vision-Language Models. Vision-Language Models [22, 46] based on contrastive learning have demonstrated impressive ability to learn generic visual representations. ...
arXiv:2206.09541v1
fatcat:jelxeuzavzdk5lwpii6xhn2jeq
« Previous
Showing results 1 — 15 out of 75,173 results