A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2020; you can also visit the original URL.
The file type is application/pdf
.
Filters
Clue: Cross-modal Coherence Modeling for Caption Generation
[article]
2020
arXiv
pre-print
step, and also train coherence-aware, controllable image captioning models. ...
We use coherence relations inspired by computational models of discourse to study the information needs and goals of image captioning. ...
Thanks to Gabriel Greenberg and the anonymous reviewers for helpful comments. We would also like to thank the Mechanical Turk annotators for their contributions. ...
arXiv:2005.00908v1
fatcat:5rx2eeacufcc7oi3y5mac46xq4
Cross-modal Coherence Modeling for Caption Generation
2020
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
unpublished
step, and also train coherence-aware, controllable image captioning models. ...
We use coherence relations inspired by computational models of discourse to study the information needs and goals of image captioning. ...
Thanks to Gabriel Greenberg and the anonymous reviewers for helpful comments. We would also like to thank the Mechanical Turk annotators for their contributions. ...
doi:10.18653/v1/2020.acl-main.583
fatcat:3qqthgtuw5hbtdmpbm3jhq4kny
Multimedia content analysis-using both audio and visual clues
2000
IEEE Signal Processing Magazine
Acknowledgments We would like to thank Howard Wactlar for providing information regarding CMU's Informedia project, Qian Huang for reviewing information regarding AT&T's Pictorial Transcript project, Rainer ...
A CCV is a collection of coherence pairs, which are numbers of coherent and incoherent pixels, for each quantized color. ...
Generation
Audio Model Based
Anchor Detection
Integrated Audio/Visual
Anchor Detection
Model Based
Anchor Detection
On-Line Anchor
Visual Model
Theme Music
Detection
Anchor Segments ...
doi:10.1109/79.888862
fatcat:lxquhqnvxbduthix4zg52v32ca
Cross-Modal Coherence for Text-to-Image Retrieval
[article]
2022
arXiv
pre-print
In this paper, we train a Cross-Modal Coherence Modelfor text-to-image retrieval task. ...
Our analysis shows that models trained with image--text coherence relations can retrieve images originally paired with target text more often than coherence-agnostic models. ...
The research presented in this paper has been supported by NSF awards IIS-1703883, IIS-1955404, IIS-1955365, IIS 1955404, RETTL-2119265, IIS-1526723, CCF-1934924, and EAGER-2122119, and through generous ...
arXiv:2109.11047v2
fatcat:t6q2hkw64rhxvpdkauzyngcpw4
Beyond Narrative Description: Generating Poetry from Images by Multi-Adversarial Training
[article]
2018
arXiv
pre-print
To solve the above challenges, we formulate the task of poem generation into two correlated sub-tasks by multi-adversarial training via policy gradient, through which the cross-modal relevance and poetic ...
Two discriminative networks are further introduced to guide the poem generation, including a multi-modal discriminator and a poem-style discriminator. ...
tasks. • We incorporate a deep coupled visual-poetic embedding model and a RNN-based generator for joint learning, in which two discriminators provide rewards for measuring cross-modality relevance and ...
arXiv:1804.08473v3
fatcat:v3xkb3gf6nay3m2xsciaww7b5u
Improving Visual Question Answering by Referring to Generated Paragraph Captions
2019
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
Empirical results show that paragraph captions, even when automatically generated (via an RL-based encoderdecoder model), help correctly answer more visual questions. ...
These paragraph captions can hence contain substantial information of the image for tasks such as visual question answering. ...
Acknowledgments We thank the reviewers for their helpful comments. ...
doi:10.18653/v1/p19-1351
dblp:conf/acl/KimB19
fatcat:47smsygpcrdghfzu2jgsaiefaa
Improvement of Commercial Boundary Detection Using Audiovisual Features
[chapter]
2005
Lecture Notes in Computer Science
According to the clues from speech-music discrimination, video scene detection, and caption detection, a multi-modal commercial detection scheme is proposed. ...
Detection of commercials in TV videos is difficult because the diversity of them puts up a high barrier to construct an appropriate model. ...
The coherence among shots in the attention span, T as , is computed. Fig.3 illustrates the ideas of video coherence and memory model. ...
doi:10.1007/11581772_68
fatcat:hvb37c5fsfaudnrzgtuvp65gxi
Deep Multimodal Learning for Affective Analysis and Retrieval
2015
IEEE transactions on multimedia
More importantly, the joint representation enables emotion-oriented cross-modal retrieval, for example, retrieval of videos using the text query "crazy cat". ...
While the model learns a joint representation over multimodal inputs, training samples in absence of certain modalities can also be leveraged. ...
For example, the text captions in Fig. 1 provide clue connecting two visually dissimilar roller coasters. ...
doi:10.1109/tmm.2015.2482228
fatcat:7tozmatnhvbj7hjjohkofngecq
Label-Attention Transformer with Geometrically Coherent Objects for Image Captioning
[article]
2021
arXiv
pre-print
The proposed technique acquires a proposal of geometrically coherent objects using a deep neural network (DNN) and generates captions by investigating their relationships using a label-attention module ...
Image captioning is a nomenclature for describing meaningful information in an image using computer vision techniques. ...
The visual clues are accumulated and processed with label attention to comprehend the language semantics with a decoder module in LATGeO and generates meaningful image captions. ...
arXiv:2109.07799v1
fatcat:3f7tef6oy5fcznv2hkearagygq
Multimodal Research in Vision and Language: A Review of Current and Emerging Trends
[article]
2020
arXiv
pre-print
We look at its applications in their task formulations and how to solve various problems related to semantic perception and content generation. ...
In this paper, we present a detailed overview of the latest trends in research pertaining to visual and language modalities. ...
[273] proposed using multi-view LSTM for modeling view-specific and cross-view interactions over time to generate robust latent codes for image captioning and multimodal behavior recognition by directly ...
arXiv:2010.09522v2
fatcat:l4npstkoqndhzn6hznr7eeys4u
Kaleido-BERT: Vision-Language Pre-training on Fashion Domain
[article]
2021
arXiv
pre-print
We present a new vision-language (VL) pre-training model dubbed Kaleido-BERT, which introduces a novel kaleido strategy for fashion cross-modality representations from transformers. ...
., rotation, jigsaw, camouflage, grey-to-color, and blank-to-color for self-supervised VL pre-training at patches of different scale. ...
As such, these techniques will benefit for general cross-modality representation learning. [8] . VG = Visual Genome [35] . CC = Conceptual Caption [59] . SBU = SBU Captions [53] . ...
arXiv:2103.16110v3
fatcat:i3meybz6bzea7ebfdut4bkycse
A Survey on Temporal Sentence Grounding in Videos
[article]
2021
arXiv
pre-print
Meanwhile, TSGV is more challenging since it requires both textual and visual understanding for semantic alignment between two modalities(i.e., text and video). ...
) to be used in TSGV, and iii) in-depth discusses potential problems of current benchmarking designs and research directions for further investigations. ...
It builds a joint graph for modelling the cross-/self-modal relations via iterative message passing, to capture the high-order interactions between two modalities effectively. ...
arXiv:2109.08039v2
fatcat:6ja4csssjzflhj426eggaf77tu
ICAF: Iterative Contrastive Alignment Framework for Multimodal Abstractive Summarization
[article]
2022
arXiv
pre-print
In addition, missing awareness of cross-modal matching from many frameworks leads to performance reduction. ...
Integrating multimodal knowledge for abstractive summarization task is a work-in-progress research area, with present techniques inheriting fusion-then-generation paradigm. ...
The iterative alignment approach may gradually update cross-modal attention in order to amass clues for identifying matching semantics and improving cross-modal information interaction. ...
arXiv:2108.05123v2
fatcat:b24tgqqpnzh4lhimfvavu2n4n4
Multimodal Analytics for Real-world News using Measures of Cross-modal Entity Consistency
[article]
2020
arXiv
pre-print
Several measures are suggested to calculate cross-modal similarity for these entities using state of the art approaches. ...
In this paper, we introduce a novel task of cross-modal consistency verification in real-world news and present a multimodal approach to quantify the entity coherence between image and text. ...
We also want to gratefully thank Avishek Anand (L3S Research Center, Leibniz Universität Hannover) for his valuable comments to improve the paper. ...
arXiv:2003.10421v2
fatcat:bqrn4xzc4zhovohco7wwx74xby
A Probabilistic Approach for Image Retrieval Using Descriptive Textual Queries
2015
Proceedings of the 23rd ACM international conference on Multimedia - MM '15
We present a probabilistic approach that seamlessly integrates visual and textual information for the task. ...
In particular, we focus on descriptive queries that can be either in the form of simple captions (e.g., "a brown cat sleeping on a sofa"), or even long descriptions with multiple sentences. ...
Both these methods have been shown to perform well for image retrieval using descriptive queries, and cross-modal retrieval in general. ...
doi:10.1145/2733373.2806289
dblp:conf/mm/VermaJ15
fatcat:pgrdsvtaqzb77b7fxuoiqxtlda
« Previous
Showing results 1 — 15 out of 977 results