Filters








343 Hits in 6.3 sec

Oracle performance for visual captioning [article]

Li Yao, Nicolas Ballas, Kyunghyun Cho, John R. Smith, Yoshua Bengio
2016 arXiv   pre-print
Furthermore, with such a bound, we quantify several important factors concerning image and video captioning: the number of visual concepts captured by different models, the trade-off between the amount  ...  Instead of proposing new models, this work investigates the possibility of empirically establishing performance upper bounds on various visual captioning datasets without extra data labelling effort or  ...  Acknowledgments The authors would like to acknowledge the support of the following agencies for research funding and computing support: IBM T.J.  ... 
arXiv:1511.04590v5 fatcat:6tppv44nvvfvta2ccj3ub4pmii

Connecting the Dots between Audio and Text without Parallel Data through Visual Knowledge Transfer [article]

Yanpeng Zhao, Jack Hessel, Youngjae Yu, Ximing Lu, Rowan Zellers, Yejin Choi
2022 arXiv   pre-print
Machines that can represent and describe environmental soundscapes have practical potential, e.g., for audio tagging and captioning systems.  ...  Our key idea is to share the image modality between bi-modal image-text representations and bi-modal image-audio representations; the image modality functions as a pivot and connects audio and text in  ...  Acknowledgements We would like to thank the AI2 Mosaic team for discussions, the AI2 Beaker team for computing support, and the anonymous reviewers for their suggestions.  ... 
arXiv:2112.08995v2 fatcat:xxzx6dvfiffcrnfntsvxjilhfy

Top-Down Visual Saliency Guided by Captions

Vasili Ramanishka, Abir Das, Jianming Zhang, Kate Saenko
2017 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)  
Evaluation on large-scale video and image datasets demonstrates that our approach achieves comparable captioning performance with existing methods while providing more accurate saliency heatmaps.  ...  Neural image/video captioning models can generate accurate descriptions, but their internal process of mapping regions to words is a black box and therefore difficult to explain.  ...  We thank Subhashini Venugopalan for providing an implementation of S2VT [22] and Stan Sclaroff for many useful discussions.  ... 
doi:10.1109/cvpr.2017.334 dblp:conf/cvpr/RamanishkaDZS17 fatcat:w2kvs4nttjfljealnrcehxwg3q

Top-down Visual Saliency Guided by Captions [article]

Vasili Ramanishka, Abir Das, Jianming Zhang, Kate Saenko
2017 arXiv   pre-print
Evaluation on large-scale video and image datasets demonstrates that our approach achieves comparable captioning performance with existing methods while providing more accurate saliency heatmaps.  ...  Neural image/video captioning models can generate accurate descriptions, but their internal process of mapping regions to words is a black box and therefore difficult to explain.  ...  We thank Subhashini Venugopalan for providing an implementation of S2VT [22] and Stan Sclaroff for many useful discussions.  ... 
arXiv:1612.07360v2 fatcat:7eehtb2mbbg6vcz2aldgxqd2du

Improving Generation and Evaluation of Visual Stories via Semantic Consistency [article]

Adyasha Maharana, Darryl Hannan, Mohit Bansal
2021 arXiv   pre-print
However, there is room for improvement of generated images in terms of visual quality, coherence and relevance.  ...  story and generated images, (2) a copy-transform mechanism for sequentially-consistent story visualization, and (3) MART-based transformers to model complex interactions between frames.  ...  Acknowledgments We thank Peter Hase, Jaemin Cho, Hyounghun Kim, and the reviewers for their useful feedback.  ... 
arXiv:2105.10026v1 fatcat:uouyfjbhgjalboi5duyv5pg7eu

Look, Read and Enrich. Learning from Scientific Figures and their Captions [article]

Jose Manuel Gomez-Perez, Raul Ortega
2019 arXiv   pre-print
Compared to natural images, understanding scientific figures is particularly hard for machines.  ...  In this paper we investigate what can be learnt by looking at a large number of figures and reading their captions, and introduce a figure-caption correspondence learning task that makes use of our observations  ...  Acknowledgments The research reported in this paper is supported by the EU Horizon 2020 programme, under grants European Language Grid-825627 and Co-inform-770302.  ... 
arXiv:1909.09070v1 fatcat:zhlscpvkmneylinylv7c5xjcji

SoDeep: a Sorting Deep net to learn ranking loss surrogates [article]

Martin Engilberge, Louis Chevallier, Patrick Pérez, Matthieu Cord
2019 arXiv   pre-print
We demonstrate the interest of our approach in three different tasks that require ranking: Cross-modal text-image retrieval, multi-label image classification and visual memorability ranking.  ...  It is trained virtually for free using synthetic data. This sorting deep (SoDeep) net can then be combined in a plug-and-play manner with existing deep architectures.  ...  caption retrieval, and by (0.3%,0.1%,0.3%) for image retrieval.  ... 
arXiv:1904.04272v1 fatcat:sr6cgy7ouvgxnid67nku3y5voq

SoDeep: A Sorting Deep Net to Learn Ranking Loss Surrogates

Martin Engilberge, Louis Chevallier, Patrick Perez, Matthieu Cord
2019 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)  
We demonstrate the interest of our approach in three different tasks that require ranking: Cross-modal text-image retrieval, multilabel image classification and visual memorability ranking.  ...  It is trained virtually for free using synthetic data. This sorting deep (SoDeep) net can then be combined in a plug-and-play manner with existing deep architectures.  ...  caption retrieval, and by (0.3%,0.1%,0.3%) for image retrieval.  ... 
doi:10.1109/cvpr.2019.01105 dblp:conf/cvpr/EngilbergeCPC19 fatcat:5z732nofr5etheedodh7at5o6u

Multimodal feature fusion based on object relation for video captioning

Zhiwen Yan, Ying Chen, Jinlong Song, Jia Zhu
2022 CAAI Transactions on Intelligence Technology  
However, most of the existing methods in the video captioning task ignore the relationship between objects in the video and the correlation between multimodal features, and they also ignore the effect  ...  Video captioning aims at automatically generating a natural language caption to describe the content of a video.  ...  ACKNOWLEDGMENTS This work was supported by the National Natural Science Foundation of China under Grant 62077015 and the Key Laboratory of Intelligent Education Technology and Application of Zhejiang Province  ... 
doi:10.1049/cit2.12071 fatcat:bmyvu6sr6zbqtac6jxfqiygutu

Distributed Attention for Grounded Image Captioning [article]

Nenglun Chen, Xingjia Pan, Runnan Chen, Lei Yang, Zhiwen Lin, Yuqiang Ren, Haolei Yuan, Xiaowei Guo, Feiyue Huang, Wenping Wang
2021 arXiv   pre-print
We study the problem of weakly supervised grounded image captioning.  ...  One main issue that has been ignored is that the attention for generating visually groundable words may only focus on the most discriminate parts and can not cover the whole object.  ...  In addition, it also contains 275 bounding box annotations and each bounding box corresponds to a visually groundable noun phrase in the caption.  ... 
arXiv:2108.01056v1 fatcat:bflaowpm2ng5dhd2k5ms2kfm7e

Improving Multimodal Speech Recognition by Data Augmentation and Speech Representations [article]

Dan Oneata, Horia Cucu
2022 arXiv   pre-print
In this work, we investigate ways of improving the base speech recognition system by following similar techniques to the ones used for the visual encoder, namely, transferring representations and data  ...  This technique replaces previously used word masking and comes with the benefits of being conceptually simpler and yielding consistent improvements in the multimodal setting.  ...  We thank the anonymous reviewers and Desmond Elliott for useful suggestions.  ... 
arXiv:2204.13206v1 fatcat:j6frzfcc5rgprbb7dizl6fgbxu

Wav2CLIP: Learning Robust Audio Representations From CLIP [article]

Ho-Hsiang Wu, Prem Seetharaman, Kundan Kumar, Juan Pablo Bello
2022 arXiv   pre-print
Finally, we demonstrate image generation from Wav2CLIP as qualitative assessment of the shared embedding space. Our code and model weights are open sourced and made available for further applications.  ...  Wav2CLIP projects audio into a shared embedding space with images and text, which enables multimodal applications such as zero-shot classification, and cross-modal retrieval.  ...  We also include SOTA results from the literature to the best of our knowledge, for topline results specifically trained for each task/dataset as upper bound compared to our approach.  ... 
arXiv:2110.11499v2 fatcat:uq6dxnke6ne5nissu6yividmsu

Sequential Person Recognition in Photo Albums with a Recurrent Network [article]

Yao Li, Guosheng Lin, Bohan Zhuang, Lingqiao Liu, Chunhua Shen, Anton van den Hengel
2016 arXiv   pre-print
In this sense, our approach is a unified framework for modeling both contextual cues and visual appearance of person instances.  ...  Recognizing the identities of people in everyday photos is still a very challenging problem for machine vision, due to non-frontal faces, changes in clothing, location, lighting and similar.  ...  For instance, LSTMs have been widely used in vision-to-language problems, such as image captioning [22, 27] , video description [6, 20] , and visual question answering [28, 26, 25] .  ... 
arXiv:1611.09967v1 fatcat:aidtmunlgrcivoa5qmyu6dhzga

Sequential Person Recognition in Photo Albums with a Recurrent Network

Yao Li, Guosheng Lin, Bohan Zhuang, Lingqiao Liu, Chunhua Shen, Anton van den Hengel
2017 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)  
In this sense, our approach is a unified framework for modeling both contextual cues and visual appearance of person instances.  ...  Recognizing the identities of people in everyday photos is still a very challenging problem for machine vision, due to issues such as non-frontal faces, changes in clothing, location and lighting.  ...  For instance, LSTMs have been widely used in machine translation [18] and vision-to-language problems, such as image captioning [22, 27] , video description [6, 20] , and visual question answering  ... 
doi:10.1109/cvpr.2017.600 dblp:conf/cvpr/LiLZLSH17 fatcat:6xhbjwnrybh4voi74r7wpfyomy

CLIP-Lite: Information Efficient Visual Representation Learning from Textual Annotations [article]

Aman Shrivastava, Ramprasaath R. Selvaraju, Nikhil Naik, Vicente Ordonez
2021 arXiv   pre-print
We evaluate CLIP-Lite by pretraining on the COCO-Captions dataset and testing transfer learning to other datasets.  ...  CLIP-Lite is also superior to CLIP on image and text retrieval, zero-shot classification, and visual grounding.  ...  In [45] , video representations are learned using paired textual metadata, however the method does not extend to visual pretraining for images.  ... 
arXiv:2112.07133v1 fatcat:6ikj4dz7fnhsvknbcc7cebxyd4
« Previous Showing results 1 — 15 out of 343 results