28,688 Hits in 4.3 sec

Learning Visual Representations using Images with Captions

Ariadna Quattoni, Michael Collins, Trevor Darrell
2007 2007 IEEE Conference on Computer Vision and Pattern Recognition  
This paper describes a method for learning representations from large quantities of unlabeled images which have associated captions; the goal is to improve learning in future image classification problems  ...  images alone and (3) a model that uses the output of word classifiers trained using captions and unlabeled data.  ...  Shortly we will describe a method for constructing auxiliary training sets using images with captions. • The aim is to learn a representation of images, i.e., a function that maps images x to feature vectors  ... 
doi:10.1109/cvpr.2007.383173 dblp:conf/cvpr/QuattoniCD07 fatcat:6go2qequzze2bbmu3j5xrrjmu4

CLIP-Lite: Information Efficient Visual Representation Learning from Textual Annotations [article]

Aman Shrivastava, Ramprasaath R. Selvaraju, Nikhil Naik, Vicente Ordonez
2021 arXiv   pre-print
Finally, by performing explicit image-text alignment during representation learning, we show that CLIP-Lite can leverage language semantics to encourage bias-free visual representations that can be used  ...  We propose CLIP-Lite, an information efficient method for visual representation learning by feature alignment with textual annotations.  ...  We use the COCO Captions dataset [9] which has 118K images with five captions with each image.  ... 
arXiv:2112.07133v1 fatcat:6ikj4dz7fnhsvknbcc7cebxyd4

Learning Visual Representations with Caption Annotations [article]

Mert Bulent Sariyildiz, Julien Perez, Diane Larlus
2020 arXiv   pre-print
To do so, motivated by the recent progresses in language models, we introduce image-conditioned masked language modeling (ICMLM) – a proxy task to learn visual representations over image-caption pairs.  ...  To tackle this task, we propose hybrid models, with dedicated visual and textual encoders, and we show that the visual representations learned as a by-product of solving this task transfer well to a variety  ...  .: Learning visual representations using images with captions. In: Proc. CVPR (2007) 2, 4, 5 50.  ... 
arXiv:2008.01392v1 fatcat:6hf54vnv4bht7ojbyszxbvtxwu

Beyond Instance-Level Image Retrieval: Leveraging Captions to Learn a Global Visual Representation for Semantic Retrieval

Albert Gordo, Diane Larlus
2017 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)  
Following this observation, we learn a visual embedding of the images where the similarity in the visual space is correlated with their semantic similarity surrogate.  ...  We further extend our model to learn a joint embedding of visual and textual cues that allows one to query the database using a text modifier in addition to the query image, adapting the results to the  ...  This differs from our work, where the end task is to learn a visual embedding to retrieve images using a query image, and where the joint embedding is used to enrich the visual representation.  ... 
doi:10.1109/cvpr.2017.560 dblp:conf/cvpr/GordoL17 fatcat:2fyivt6avfccjjb32ua772dx24

Multimodal Contrastive Training for Visual Representation Learning [article]

Xin Yuan, Zhe Lin, Jason Kuen, Jianming Zhang, Yilin Wang, Michael Maire, Ajinkya Kale, Baldo Faieta
2021 arXiv   pre-print
correlation simultaneously, hence improving the quality of learned visual representations.  ...  We first train our model on COCO and evaluate the learned visual representations on various downstream tasks including image classification, object detection, and instance segmentation.  ...  Our method shares the same spirit with these methods, in that we both use contrastive visual representation learning.  ... 
arXiv:2104.12836v1 fatcat:r5hpxd32vfct5jfk4seo5ztslm

Contrastive Learning for Weakly Supervised Phrase Grounding [article]

Tanmay Gupta, Arash Vahdat, Gal Chechik, Xiaodong Yang, Jan Kautz, Derek Hoiem
2020 arXiv   pre-print
We show that phrase grounding can be learned by optimizing word-region attention to maximize a lower bound on mutual information between images and caption words.  ...  Given pairs of images and captions, we maximize compatibility of the attention-weighted regions and the words in the corresponding caption, compared to non-corresponding pairs of images and captions.  ...  Joint Image-Text Representation Learning.  ... 
arXiv:2006.09920v3 fatcat:2fecqopa3jdjlf5psmjykdj52q

End-to-end Image Captioning Exploits Multimodal Distributional Similarity [article]

Pranava Madhyastha, Josiah Wang, Lucia Specia
2018 arXiv   pre-print
We conclude that regardless of the image representation used image captioning systems seem to match images and generate captions in a learned joint image-text semantic subspace.  ...  representation is compressed to a lower dimensional space; (iii) cluster images with similar visual and linguistic information together.  ...  We visualize the initial visual subspace and the learned joint visual semantic subspace and observe that the visual semantic subspace has learned to cluster images with similar visual and linguistic information  ... 
arXiv:1809.04144v1 fatcat:3odhf3xtcfeq7obx57b766rwhm

VirTex: Learning Visual Representations from Textual Annotations [article]

Karan Desai, Justin Johnson
2021 arXiv   pre-print
We propose VirTex -- a pretraining approach using semantically dense captions to learn visual representations.  ...  Recent methods have explored unsupervised pretraining to scale to vast quantities of unlabeled images. In contrast, we aim to learn high-quality visual representations from fewer images.  ...  We thank Jia Deng for access to extra GPUs during project development; and UMich ARC-TS team for support with GPU cluster management.  ... 
arXiv:2006.06666v3 fatcat:ifck6jbayvc4hcrznk6icqghga

Federated Learning for Vision-and-Language Grounding Problems

Fenglin Liu, Xian Wu, Shen Ge, Wei Fan, Yuexian Zou
., image captioning and visual question answering (VQA), has attracted extensive interests from both academic and industrial worlds.  ...  fine-grained image representations.  ...  For equipping with our aimNet in baseline models, i.e., using the fine-grained image representations learned by aim-Net in baseline models, we replace the original features with the refined features directly  ... 
doi:10.1609/aaai.v34i07.6824 fatcat:uz72aopvavgnxcnzpxc4nliwh4

Align2Ground: Weakly Supervised Phrase Grounding Guided by Image-Caption Alignment [article]

Samyak Datta, Karan Sikka, Anirban Roy, Karuna Ahuja, Devi Parikh, Ajay Divakaran
2019 arXiv   pre-print
In a subsequent step, this (learned) representation is aligned with the caption.  ...  Our method, as a first step, infers the latent correspondences between regions-of-interest (RoIs) and phrases in the caption and creates a discriminative image representation using these matched RoIs.  ...  Module takes the caption-conditioned representation of the image and learns to align it with the caption using a ranking loss.  ... 
arXiv:1903.11649v2 fatcat:nm3wbg2swbgszdnnfalry3oaeu

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks [article]

Jiasen Lu, Dhruv Batra, Devi Parikh, Stefan Lee
2019 arXiv   pre-print
We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language.  ...  answering, visual commonsense reasoning, referring expressions, and caption-based image retrieval -- by making only minor additions to the base architecture.  ...  We use the Adam optimizer with initial learning rates of 1e-4. We use a linear decay learning rate schedule with warm up to train the model. Both training task losses are weighed equally.  ... 
arXiv:1908.02265v1 fatcat:6qlqpknrcnf5lmhe27t7jht5ca

Improving Image Captioning with Better Use of Captions [article]

Zhan Shi, Xu Zhou, Xipeng Qiu, Xiaodan Zhu
2020 arXiv   pre-print
The representation is then enhanced with neighbouring and contextual nodes with their textual and visual features.  ...  Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning.  ...  These nodes can be viewed as image representation units used for generation.  ... 
arXiv:2006.11807v1 fatcat:v44y6ubghncz7pkklt2waumsky

Evaluating Multimodal Representations on Sentence Similarity: vSTS, Visual Semantic Textual Similarity Dataset [article]

Oier Lopez de Lacalle, Aitor Soroa, Eneko Agirre
2018 arXiv   pre-print
The dataset is comprised by images along with its respectively textual captions.  ...  In this paper we introduce vSTS, a new dataset for measuring textual similarity of sentences using multimodal information.  ...  [9] provides an approach that learns to align images with descriptions.  ... 
arXiv:1809.03695v1 fatcat:6rbqiiupmfekfcadtcq3cvv3su

Learning language through pictures

Grzegorz Chrupała, Àkos Kádár, Afra Alishahi
2015 Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)  
We propose IMAGINET, a model of learning visually grounded representations of language from coupled textual and visual input.  ...  Mimicking an important aspect of human language learning, it acquires meaning representations for individual words from descriptions of visual scenes.  ...  ., 2013) for learning word representations via a multi-task objective similar to ours, learning from a dataset where some words are individually aligned with corresponding images.  ... 
doi:10.3115/v1/p15-2019 dblp:conf/acl/ChrupalaKA15 fatcat:zz3lhjxiejfjrjpbjx3gfnmrnq

Evaluating Multimodal Representations on Visual Semantic Textual Similarity [article]

Oier Lopez de Lacalle, Ander Salaberria, Aitor Soroa, Gorka Azkune, Eneko Agirre
2020 arXiv   pre-print
The combination of visual and textual representations has produced excellent results in tasks such as image captioning and visual question answering, but the inference capabilities of multimodal representations  ...  Our experiments using simple multimodal representations show that the addition of image representations produces better inference, compared to text-only representations.  ...  For image representation we use a model pre-trained on Imagenet (ResNet [17] ). In order to combine visual and textual representations we used concatenation and learn simple projections.  ... 
arXiv:2004.01894v1 fatcat:tuoemo6mrrbwxhbubbon6lt5ny
« Previous Showing results 1 — 15 out of 28,688 results