1,645 Hits in 5.0 sec

Multimodal Few-Shot Learning with Frozen Language Models [article]

Maria Tsimpoukelli, Jacob Menick, Serkan Cabi, S. M. Ali Eslami, Oriol Vinyals, Felix Hill
2021 arXiv   pre-print
We demonstrate that it can rapidly learn words for new objects and novel visual categories, do visual question-answering with only a handful of examples, and make use of outside knowledge, by measuring  ...  Here, we present a simple, yet effective, approach for transferring this few-shot learning ability to a multimodal setting (vision and language).  ...  Acknowledgements We wish to thank Sebastian Borgeaud and Jack Rae for preparing the pretraining text dataset and pretraining a selection of transformer language models, as well as Trevor Cai for help with  ... 
arXiv:2106.13884v2 fatcat:t5jhbtqgc5dn7gbzr2cxnpqjdq

A Good Prompt Is Worth Millions of Parameters: Low-resource Prompt-based Learning for Vision-Language Models [article]

Woojeong Jin, Yu Cheng, Yelong Shen, Weizhu Chen, Xiang Ren
2022 arXiv   pre-print
Furthermore, we analyze the effect of diverse prompts for few-shot tasks.  ...  given larger training data, and (3) MaskedLM helps VQA tasks while PrefixLM boosts captioning performance.  ...  For FEWVLM, we use "question: [Q] answer <text_1>" (P3) as an input prompt and "<text_1> [A]" as a target prompt for visual question answering, and "an image of" (Q3) as an input prompt for captioning,  ... 
arXiv:2110.08484v2 fatcat:dsfqfdvlhbenpfkgjgqlyulaba

MAGMA – Multimodal Augmentation of Generative Models through Adapter-based Finetuning [article]

Constantin Eichenberg, Sidney Black, Samuel Weinbach, Letitia Parcalabescu, Anette Frank
2021 arXiv   pre-print
We present MAGMA - a simple method for augmenting generative language models with additional modalities using adapter-based finetuning.  ...  Building on Frozen, we train a series of VL models that autoregressively generate text from arbitrary combinations of visual and textual input.  ...  We use the same visual encoder (CLIP 'RN50x16') for all adapter ablations and evaluate the open-ended few-shot scores on the Visual Question Answering and Image Captioning tasks described in 4.1.1 and  ... 
arXiv:2112.05253v1 fatcat:2pjd6mfwrvbsxkzfdxbflwiy54

Learning Compositional Representation for Few-shot Visual Question Answering [article]

Dalu Guo, Dacheng Tao
2021 arXiv   pre-print
Current methods of Visual Question Answering perform well on the answers with an amount of training data but have limited accuracy on the novel ones with few examples.  ...  We generate the few-shot dataset of VQA with a variety of answers and their attributes without any human effort.  ...  , e.g. image captioning [1] , visual grounding [2] , visual question answering (VQA) [3] , [4] , [5] , and visual dialog [6] , [7] .  ... 
arXiv:2102.10575v1 fatcat:2ftyqx6sevew3ipuarjque6w44

Attributes as Semantic Units between Natural Language and Visual Recognition [article]

Marcus Rohrbach
2016 arXiv   pre-print
we can ground natural language in visual content, and finally, how we can answer natural language questions about images.  ...  Specifically we discuss how attributes allow using knowledge mined from language resources for recognizing novel visual categories, how we can generate sentence description about images and video, how  ...  Visual question answering Visual question answering is the problem of answering natural language questions about images, e.g. for the question "Where is the amber cat?"  ... 
arXiv:1604.03249v1 fatcat:a5dpwgoddvcsvkovik2gupbri4

Flamingo: a Visual Language Model for Few-Shot Learning [article]

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford (+15 others)
2022 arXiv   pre-print
an event, and close-ended tasks such as multiple choice visual question-answering.  ...  These include open-ended tasks such as visual question-answering, where the model is prompted with a question which it has to answer, captioning tasks, which evaluate the ability to describe a scene or  ...  Acknowledgements We would like to thank many of our colleagues for useful discussions, suggestions, feedback, and advice, including: Relja Arandjelović, Kareem Ayoub, Lorrayne Bennett, Adria Recasens Continente  ... 
arXiv:2204.14198v1 fatcat:5f4uhdmaibhm7cn3zetspjev3q

SimVLM: Simple Visual Language Model Pretraining with Weak Supervision [article]

Zirui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yulia Tsvetkov, Yuan Cao
2022 arXiv   pre-print
Furthermore, we demonstrate that SimVLM acquires strong generalization and transfer ability, enabling zero-shot behavior including open-ended visual question answering and cross-modality transfer.  ...  However, the requirement for expensive annotations including clean image captions and regional labels limits the scalability of existing approaches, and complicates the pretraining procedure with the introduction  ...  Daruki, Nan Du and Aashi Jain for help with data preparation, Jonathan Shen, Colin Raffel and Sharan Narang for assistance on experimental settings, and others in the Google Brain team for support throughout  ... 
arXiv:2108.10904v3 fatcat:glozbeeytvdyvcgl7ersyz4i34

CoCa: Contrastive Captioners are Image-Text Foundation Models [article]

Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, Yonghui Wu
2022 arXiv   pre-print
Empirically, CoCa achieves state-of-the-art performance with zero-shot transfer or minimal task-specific adaptation on a broad range of downstream tasks, spanning visual recognition (ImageNet, Kinetics  ...  cascades the remaining decoder layers which cross-attend to the image encoder for multimodal image-text representations.  ...  for help with visual illustration, Liangliang Cao for proofreading, and others in the Google Brain team for support throughout this project.  ... 
arXiv:2205.01917v1 fatcat:pxqshctdxnhx5ktqooigdttfrq

Tip-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling [article]

Renrui Zhang, Rongyao Fang, Wei Zhang, Peng Gao, Kunchang Li, Jifeng Dai, Yu Qiao, Hongsheng Li
2021 arXiv   pre-print
To further enhance CLIP's few-shot capability, CLIP-Adapter proposed to fine-tune a lightweight residual feature adapter and significantly improves the performance for few-shot classification.  ...  Contrastive Vision-Language Pre-training, known as CLIP, has provided a new paradigm for learning visual representations by using large-scale contrastive image-text pairs.  ...  Answering [1, 2, 18, 33] , Image Captioning [29, 67] , and Referring Expression [68] .  ... 
arXiv:2111.03930v2 fatcat:ntojz5cn65eghfqvbmcgij4s2i

Multimodal Intelligence: Representation Learning, Information Fusion, and Applications [article]

Chao Zhang, Zichao Yang, Xiaodong He, Li Deng
2020 arXiv   pre-print
Regarding applications, selected areas of a broad interest in the current literature are covered, including image-to-text caption generation, text-to-image generation, and visual question answering.  ...  In this paper, we provide a technical review of available models and learning methods for multimodal intelligence.  ...  ACKNOWLEDGEMENT The authors are grateful to the editor and anonymous reviewers for their valuable suggestions that helped to make this paper better.  ... 
arXiv:1911.03977v3 fatcat:ojazuw3qzvfqrdweul6qdpxuo4

VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts [article]

Renrui Zhang, Longtian Qiu, Wei Zhang, Ziyao Zeng
2021 arXiv   pre-print
Specifically, we guide the text feature to adaptively explore informative regions on the image and aggregate the visual feature by cross-attention machanism.  ...  Supervised by large-scale image-text pairs, CLIP is able to align paired images and texts and thus conduct zero-shot recognition in open-vocabulary scenarios.  ...  The scale of train- visual question answering.  ... 
arXiv:2112.02399v1 fatcat:pk7gjz5ewnfdvayab7ljiwpfli

How Much Can CLIP Benefit Vision-and-Language Tasks? [article]

Sheng Shen, Liunian Harold Li, Hao Tan, Mohit Bansal, Anna Rohrbach, Kai-Wei Chang, Zhewei Yao, Kurt Keutzer
2021 arXiv   pre-print
image-caption pairs, has shown a strong zero-shot capability on various vision tasks.  ...  We achieve competitive or better results on diverse V&L tasks, while establishing new state-of-the-art results on Visual Question Answering, Visual Entailment, and V&L Navigation tasks.  ...  Visual Question Answering The task of Visual Question Answering (VQA) (Antol et al., 2015) is to provide the answer given an image and a related question.  ... 
arXiv:2107.06383v1 fatcat:iwyntpju3fg6fcmbevlxdjch6m

TextCaps: a Dataset for Image Captioning with Reading Comprehension [article]

Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach, Amanpreet Singh
2020 arXiv   pre-print
We study baselines and adapt existing approaches to this new task, which we refer to as image captioning with reading comprehension.  ...  To study how to comprehend text in the context of an image we collect a novel dataset, TextCaps, with 145k captions for 28k images.  ...  We would like to thank Guan Pang and Mandy Toh for helping us with OCR ground-truth collection. We would also like to thank Devi Parikh for helpful discussions and insights.  ... 
arXiv:2003.12462v2 fatcat:wjkzyqxa5vdktf2kjvp5wiglom

UNIMO-2: End-to-End Unified Vision-Language Grounded Learning [article]

Wei Li, Can Gao, Guocheng Niu, Xinyan Xiao, Hao Liu, Jiachen Liu, Hua Wu, Haifeng Wang
2022 arXiv   pre-print
In this paper, we propose an end-to-end unified-modal pre-training framework, namely UNIMO-2, for joint learning on both aligned image-caption data and unaligned image-only and text-only corpus.  ...  In particular, we propose to conduct grounded learning on both images and texts via a sharing grounded space, which helps bridge unaligned images and texts, and align the visual and textual semantic spaces  ...  We evaluate UNIMO-2 on a variety of representative vision-language understanding and generation tasks, including image/text retrieval, visual question answering, visual reasoning and image caption.  ... 
arXiv:2203.09067v1 fatcat:kjgqruympnegjoolcebopwhyim

Learning like a Child: Fast Novel Visual Concept Learning from Sentence Descriptions of Images [article]

Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, Alan Yuille
2015 arXiv   pre-print
In this paper, we address the task of learning novel visual concepts, and their interactions with other concepts, from a few images with sentence descriptions.  ...  Using linguistic context and visual features, our method is able to efficiently hypothesize the semantic meaning of new words and add them to its word dictionary so that they can be used to describe images  ...  Acknowledgement We thank the comments and suggestions of the anonymous reviewers, and help from Xiaochen Lian in the dataset collection process.  ... 
arXiv:1504.06692v2 fatcat:yuzdpp5ylrcmfpytehy4dxoaqi
« Previous Showing results 1 — 15 out of 1,645 results