Filters








977 Hits in 5.1 sec

Clue: Cross-modal Coherence Modeling for Caption Generation [article]

Malihe Alikhani, Piyush Sharma, Shengjie Li, Radu Soricut, Matthew Stone
2020 arXiv   pre-print
step, and also train coherence-aware, controllable image captioning models.  ...  We use coherence relations inspired by computational models of discourse to study the information needs and goals of image captioning.  ...  Thanks to Gabriel Greenberg and the anonymous reviewers for helpful comments. We would also like to thank the Mechanical Turk annotators for their contributions.  ... 
arXiv:2005.00908v1 fatcat:5rx2eeacufcc7oi3y5mac46xq4

Cross-modal Coherence Modeling for Caption Generation

Malihe Alikhani, Piyush Sharma, Shengjie Li, Radu Soricut, Matthew Stone
2020 Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics   unpublished
step, and also train coherence-aware, controllable image captioning models.  ...  We use coherence relations inspired by computational models of discourse to study the information needs and goals of image captioning.  ...  Thanks to Gabriel Greenberg and the anonymous reviewers for helpful comments. We would also like to thank the Mechanical Turk annotators for their contributions.  ... 
doi:10.18653/v1/2020.acl-main.583 fatcat:3qqthgtuw5hbtdmpbm3jhq4kny

Multimedia content analysis-using both audio and visual clues

Yao Wang, Zhu Liu, Jin-Cheng Huang
2000 IEEE Signal Processing Magazine  
Acknowledgments We would like to thank Howard Wactlar for providing information regarding CMU's Informedia project, Qian Huang for reviewing information regarding AT&T's Pictorial Transcript project, Rainer  ...  A CCV is a collection of coherence pairs, which are numbers of coherent and incoherent pixels, for each quantized color.  ...  Generation Audio Model Based Anchor Detection Integrated Audio/Visual Anchor Detection Model Based Anchor Detection On-Line Anchor Visual Model Theme Music Detection Anchor Segments  ... 
doi:10.1109/79.888862 fatcat:lxquhqnvxbduthix4zg52v32ca

Cross-Modal Coherence for Text-to-Image Retrieval [article]

Malihe Alikhani, Fangda Han, Hareesh Ravi, Mubbasir Kapadia, Vladimir Pavlovic, Matthew Stone
2022 arXiv   pre-print
In this paper, we train a Cross-Modal Coherence Modelfor text-to-image retrieval task.  ...  Our analysis shows that models trained with image--text coherence relations can retrieve images originally paired with target text more often than coherence-agnostic models.  ...  The research presented in this paper has been supported by NSF awards IIS-1703883, IIS-1955404, IIS-1955365, IIS 1955404, RETTL-2119265, IIS-1526723, CCF-1934924, and EAGER-2122119, and through generous  ... 
arXiv:2109.11047v2 fatcat:t6q2hkw64rhxvpdkauzyngcpw4

Beyond Narrative Description: Generating Poetry from Images by Multi-Adversarial Training [article]

Bei Liu, Jianlong Fu, Makoto P. Kato, Masatoshi Yoshikawa
2018 arXiv   pre-print
To solve the above challenges, we formulate the task of poem generation into two correlated sub-tasks by multi-adversarial training via policy gradient, through which the cross-modal relevance and poetic  ...  Two discriminative networks are further introduced to guide the poem generation, including a multi-modal discriminator and a poem-style discriminator.  ...  tasks. • We incorporate a deep coupled visual-poetic embedding model and a RNN-based generator for joint learning, in which two discriminators provide rewards for measuring cross-modality relevance and  ... 
arXiv:1804.08473v3 fatcat:v3xkb3gf6nay3m2xsciaww7b5u

Improving Visual Question Answering by Referring to Generated Paragraph Captions

Hyounghun Kim, Mohit Bansal
2019 Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics  
Empirical results show that paragraph captions, even when automatically generated (via an RL-based encoderdecoder model), help correctly answer more visual questions.  ...  These paragraph captions can hence contain substantial information of the image for tasks such as visual question answering.  ...  Acknowledgments We thank the reviewers for their helpful comments.  ... 
doi:10.18653/v1/p19-1351 dblp:conf/acl/KimB19 fatcat:47smsygpcrdghfzu2jgsaiefaa

Improvement of Commercial Boundary Detection Using Audiovisual Features [chapter]

Jun-Cheng Chen, Jen-Hao Yeh, Wei-Ta Chu, Jin-Hau Kuo, Ja-Ling Wu
2005 Lecture Notes in Computer Science  
According to the clues from speech-music discrimination, video scene detection, and caption detection, a multi-modal commercial detection scheme is proposed.  ...  Detection of commercials in TV videos is difficult because the diversity of them puts up a high barrier to construct an appropriate model.  ...  The coherence among shots in the attention span, T as , is computed. Fig.3 illustrates the ideas of video coherence and memory model.  ... 
doi:10.1007/11581772_68 fatcat:hvb37c5fsfaudnrzgtuvp65gxi

Deep Multimodal Learning for Affective Analysis and Retrieval

Lei Pang, Shiai Zhu, Chong-Wah Ngo
2015 IEEE transactions on multimedia  
More importantly, the joint representation enables emotion-oriented cross-modal retrieval, for example, retrieval of videos using the text query "crazy cat".  ...  While the model learns a joint representation over multimodal inputs, training samples in absence of certain modalities can also be leveraged.  ...  For example, the text captions in Fig. 1 provide clue connecting two visually dissimilar roller coasters.  ... 
doi:10.1109/tmm.2015.2482228 fatcat:7tozmatnhvbj7hjjohkofngecq

Label-Attention Transformer with Geometrically Coherent Objects for Image Captioning [article]

Shikha Dubey, Farrukh Olimov, Muhammad Aasim Rafique, Joonmo Kim, Moongu Jeon
2021 arXiv   pre-print
The proposed technique acquires a proposal of geometrically coherent objects using a deep neural network (DNN) and generates captions by investigating their relationships using a label-attention module  ...  Image captioning is a nomenclature for describing meaningful information in an image using computer vision techniques.  ...  The visual clues are accumulated and processed with label attention to comprehend the language semantics with a decoder module in LATGeO and generates meaningful image captions.  ... 
arXiv:2109.07799v1 fatcat:3f7tef6oy5fcznv2hkearagygq

Multimodal Research in Vision and Language: A Review of Current and Emerging Trends [article]

Shagun Uppal, Sarthak Bhagat, Devamanyu Hazarika, Navonil Majumdar, Soujanya Poria, Roger Zimmermann, Amir Zadeh
2020 arXiv   pre-print
We look at its applications in their task formulations and how to solve various problems related to semantic perception and content generation.  ...  In this paper, we present a detailed overview of the latest trends in research pertaining to visual and language modalities.  ...  [273] proposed using multi-view LSTM for modeling view-specific and cross-view interactions over time to generate robust latent codes for image captioning and multimodal behavior recognition by directly  ... 
arXiv:2010.09522v2 fatcat:l4npstkoqndhzn6hznr7eeys4u

Kaleido-BERT: Vision-Language Pre-training on Fashion Domain [article]

Mingchen Zhuge, Dehong Gao, Deng-Ping Fan, Linbo Jin, Ben Chen, Haoming Zhou, Minghui Qiu, Ling Shao
2021 arXiv   pre-print
We present a new vision-language (VL) pre-training model dubbed Kaleido-BERT, which introduces a novel kaleido strategy for fashion cross-modality representations from transformers.  ...  ., rotation, jigsaw, camouflage, grey-to-color, and blank-to-color for self-supervised VL pre-training at patches of different scale.  ...  As such, these techniques will benefit for general cross-modality representation learning. [8] . VG = Visual Genome [35] . CC = Conceptual Caption [59] . SBU = SBU Captions [53] .  ... 
arXiv:2103.16110v3 fatcat:i3meybz6bzea7ebfdut4bkycse

A Survey on Temporal Sentence Grounding in Videos [article]

Xiaohan Lan, Yitian Yuan, Xin Wang, Zhi Wang, Wenwu Zhu
2021 arXiv   pre-print
Meanwhile, TSGV is more challenging since it requires both textual and visual understanding for semantic alignment between two modalities(i.e., text and video).  ...  ) to be used in TSGV, and iii) in-depth discusses potential problems of current benchmarking designs and research directions for further investigations.  ...  It builds a joint graph for modelling the cross-/self-modal relations via iterative message passing, to capture the high-order interactions between two modalities effectively.  ... 
arXiv:2109.08039v2 fatcat:6ja4csssjzflhj426eggaf77tu

ICAF: Iterative Contrastive Alignment Framework for Multimodal Abstractive Summarization [article]

Zijian Zhang, Chang Shu, Youxin Chen, Jing Xiao, Qian Zhang, Lu Zheng
2022 arXiv   pre-print
In addition, missing awareness of cross-modal matching from many frameworks leads to performance reduction.  ...  Integrating multimodal knowledge for abstractive summarization task is a work-in-progress research area, with present techniques inheriting fusion-then-generation paradigm.  ...  The iterative alignment approach may gradually update cross-modal attention in order to amass clues for identifying matching semantics and improving cross-modal information interaction.  ... 
arXiv:2108.05123v2 fatcat:b24tgqqpnzh4lhimfvavu2n4n4

Multimodal Analytics for Real-world News using Measures of Cross-modal Entity Consistency [article]

Eric Müller-Budack, Jonas Theiner, Sebastian Diering, Maximilian Idahl, Ralph Ewerth
2020 arXiv   pre-print
Several measures are suggested to calculate cross-modal similarity for these entities using state of the art approaches.  ...  In this paper, we introduce a novel task of cross-modal consistency verification in real-world news and present a multimodal approach to quantify the entity coherence between image and text.  ...  We also want to gratefully thank Avishek Anand (L3S Research Center, Leibniz Universität Hannover) for his valuable comments to improve the paper.  ... 
arXiv:2003.10421v2 fatcat:bqrn4xzc4zhovohco7wwx74xby

A Probabilistic Approach for Image Retrieval Using Descriptive Textual Queries

Yashaswi Verma, C.V. Jawahar
2015 Proceedings of the 23rd ACM international conference on Multimedia - MM '15  
We present a probabilistic approach that seamlessly integrates visual and textual information for the task.  ...  In particular, we focus on descriptive queries that can be either in the form of simple captions (e.g., "a brown cat sleeping on a sofa"), or even long descriptions with multiple sentences.  ...  Both these methods have been shown to perform well for image retrieval using descriptive queries, and cross-modal retrieval in general.  ... 
doi:10.1145/2733373.2806289 dblp:conf/mm/VermaJ15 fatcat:pgrdsvtaqzb77b7fxuoiqxtlda
« Previous Showing results 1 — 15 out of 977 results