A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2021; you can also visit the original URL.
The file type is application/pdf
.
Filters
Multimodal Few-Shot Learning with Frozen Language Models
[article]
2021
arXiv
pre-print
We demonstrate that it can rapidly learn words for new objects and novel visual categories, do visual question-answering with only a handful of examples, and make use of outside knowledge, by measuring ...
Here, we present a simple, yet effective, approach for transferring this few-shot learning ability to a multimodal setting (vision and language). ...
Acknowledgements We wish to thank Sebastian Borgeaud and Jack Rae for preparing the pretraining text dataset and pretraining a selection of transformer language models, as well as Trevor Cai for help with ...
arXiv:2106.13884v2
fatcat:t5jhbtqgc5dn7gbzr2cxnpqjdq
A Good Prompt Is Worth Millions of Parameters: Low-resource Prompt-based Learning for Vision-Language Models
[article]
2022
arXiv
pre-print
Furthermore, we analyze the effect of diverse prompts for few-shot tasks. ...
given larger training data, and (3) MaskedLM helps VQA tasks while PrefixLM boosts captioning performance. ...
For FEWVLM, we use "question: [Q] answer <text_1>" (P3) as an input prompt and "<text_1> [A]" as a target prompt for visual question answering, and "an image of" (Q3) as an input prompt for captioning, ...
arXiv:2110.08484v2
fatcat:dsfqfdvlhbenpfkgjgqlyulaba
MAGMA – Multimodal Augmentation of Generative Models through Adapter-based Finetuning
[article]
2021
arXiv
pre-print
We present MAGMA - a simple method for augmenting generative language models with additional modalities using adapter-based finetuning. ...
Building on Frozen, we train a series of VL models that autoregressively generate text from arbitrary combinations of visual and textual input. ...
We use the same visual encoder (CLIP 'RN50x16') for all adapter ablations and evaluate the open-ended few-shot scores on the Visual Question Answering and Image Captioning tasks described in 4.1.1 and ...
arXiv:2112.05253v1
fatcat:2pjd6mfwrvbsxkzfdxbflwiy54
Learning Compositional Representation for Few-shot Visual Question Answering
[article]
2021
arXiv
pre-print
Current methods of Visual Question Answering perform well on the answers with an amount of training data but have limited accuracy on the novel ones with few examples. ...
We generate the few-shot dataset of VQA with a variety of answers and their attributes without any human effort. ...
, e.g. image captioning [1] , visual grounding [2] , visual question answering (VQA) [3] , [4] , [5] , and visual dialog [6] , [7] . ...
arXiv:2102.10575v1
fatcat:2ftyqx6sevew3ipuarjque6w44
Attributes as Semantic Units between Natural Language and Visual Recognition
[article]
2016
arXiv
pre-print
we can ground natural language in visual content, and finally, how we can answer natural language questions about images. ...
Specifically we discuss how attributes allow using knowledge mined from language resources for recognizing novel visual categories, how we can generate sentence description about images and video, how ...
Visual question answering Visual question answering is the problem of answering natural language questions about images, e.g. for the question "Where is the amber cat?" ...
arXiv:1604.03249v1
fatcat:a5dpwgoddvcsvkovik2gupbri4
Flamingo: a Visual Language Model for Few-Shot Learning
[article]
2022
arXiv
pre-print
an event, and close-ended tasks such as multiple choice visual question-answering. ...
These include open-ended tasks such as visual question-answering, where the model is prompted with a question which it has to answer, captioning tasks, which evaluate the ability to describe a scene or ...
Acknowledgements We would like to thank many of our colleagues for useful discussions, suggestions, feedback, and advice, including: Relja Arandjelović, Kareem Ayoub, Lorrayne Bennett, Adria Recasens Continente ...
arXiv:2204.14198v1
fatcat:5f4uhdmaibhm7cn3zetspjev3q
SimVLM: Simple Visual Language Model Pretraining with Weak Supervision
[article]
2022
arXiv
pre-print
Furthermore, we demonstrate that SimVLM acquires strong generalization and transfer ability, enabling zero-shot behavior including open-ended visual question answering and cross-modality transfer. ...
However, the requirement for expensive annotations including clean image captions and regional labels limits the scalability of existing approaches, and complicates the pretraining procedure with the introduction ...
Daruki, Nan Du and Aashi Jain for help with data preparation, Jonathan Shen, Colin Raffel and Sharan Narang for assistance on experimental settings, and others in the Google Brain team for support throughout ...
arXiv:2108.10904v3
fatcat:glozbeeytvdyvcgl7ersyz4i34
CoCa: Contrastive Captioners are Image-Text Foundation Models
[article]
2022
arXiv
pre-print
Empirically, CoCa achieves state-of-the-art performance with zero-shot transfer or minimal task-specific adaptation on a broad range of downstream tasks, spanning visual recognition (ImageNet, Kinetics ...
cascades the remaining decoder layers which cross-attend to the image encoder for multimodal image-text representations. ...
for help with visual illustration, Liangliang Cao for proofreading, and others in the Google Brain team for support throughout this project. ...
arXiv:2205.01917v1
fatcat:pxqshctdxnhx5ktqooigdttfrq
Tip-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling
[article]
2021
arXiv
pre-print
To further enhance CLIP's few-shot capability, CLIP-Adapter proposed to fine-tune a lightweight residual feature adapter and significantly improves the performance for few-shot classification. ...
Contrastive Vision-Language Pre-training, known as CLIP, has provided a new paradigm for learning visual representations by using large-scale contrastive image-text pairs. ...
Answering [1, 2, 18, 33] , Image Captioning [29, 67] , and Referring Expression [68] . ...
arXiv:2111.03930v2
fatcat:ntojz5cn65eghfqvbmcgij4s2i
Multimodal Intelligence: Representation Learning, Information Fusion, and Applications
[article]
2020
arXiv
pre-print
Regarding applications, selected areas of a broad interest in the current literature are covered, including image-to-text caption generation, text-to-image generation, and visual question answering. ...
In this paper, we provide a technical review of available models and learning methods for multimodal intelligence. ...
ACKNOWLEDGEMENT The authors are grateful to the editor and anonymous reviewers for their valuable suggestions that helped to make this paper better. ...
arXiv:1911.03977v3
fatcat:ojazuw3qzvfqrdweul6qdpxuo4
VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts
[article]
2021
arXiv
pre-print
Specifically, we guide the text feature to adaptively explore informative regions on the image and aggregate the visual feature by cross-attention machanism. ...
Supervised by large-scale image-text pairs, CLIP is able to align paired images and texts and thus conduct zero-shot recognition in open-vocabulary scenarios. ...
The scale of train- visual question answering. ...
arXiv:2112.02399v1
fatcat:pk7gjz5ewnfdvayab7ljiwpfli
How Much Can CLIP Benefit Vision-and-Language Tasks?
[article]
2021
arXiv
pre-print
image-caption pairs, has shown a strong zero-shot capability on various vision tasks. ...
We achieve competitive or better results on diverse V&L tasks, while establishing new state-of-the-art results on Visual Question Answering, Visual Entailment, and V&L Navigation tasks. ...
Visual Question Answering The task of Visual Question Answering (VQA) (Antol et al., 2015) is to provide the answer given an image and a related question. ...
arXiv:2107.06383v1
fatcat:iwyntpju3fg6fcmbevlxdjch6m
TextCaps: a Dataset for Image Captioning with Reading Comprehension
[article]
2020
arXiv
pre-print
We study baselines and adapt existing approaches to this new task, which we refer to as image captioning with reading comprehension. ...
To study how to comprehend text in the context of an image we collect a novel dataset, TextCaps, with 145k captions for 28k images. ...
We would like to thank Guan Pang and Mandy Toh for helping us with OCR ground-truth collection. We would also like to thank Devi Parikh for helpful discussions and insights. ...
arXiv:2003.12462v2
fatcat:wjkzyqxa5vdktf2kjvp5wiglom
UNIMO-2: End-to-End Unified Vision-Language Grounded Learning
[article]
2022
arXiv
pre-print
In this paper, we propose an end-to-end unified-modal pre-training framework, namely UNIMO-2, for joint learning on both aligned image-caption data and unaligned image-only and text-only corpus. ...
In particular, we propose to conduct grounded learning on both images and texts via a sharing grounded space, which helps bridge unaligned images and texts, and align the visual and textual semantic spaces ...
We evaluate UNIMO-2 on a variety of representative vision-language understanding and generation tasks, including image/text retrieval, visual question answering, visual reasoning and image caption. ...
arXiv:2203.09067v1
fatcat:kjgqruympnegjoolcebopwhyim
Learning like a Child: Fast Novel Visual Concept Learning from Sentence Descriptions of Images
[article]
2015
arXiv
pre-print
In this paper, we address the task of learning novel visual concepts, and their interactions with other concepts, from a few images with sentence descriptions. ...
Using linguistic context and visual features, our method is able to efficiently hypothesize the semantic meaning of new words and add them to its word dictionary so that they can be used to describe images ...
Acknowledgement We thank the comments and suggestions of the anonymous reviewers, and help from Xiaochen Lian in the dataset collection process. ...
arXiv:1504.06692v2
fatcat:yuzdpp5ylrcmfpytehy4dxoaqi
« Previous
Showing results 1 — 15 out of 1,645 results