Filters








348 Hits in 7.1 sec

Retrieve, Caption, Generate: Visual Grounding for Enhancing Commonsense in Text Generation Models [article]

Steven Y. Feng, Kevin Lu, Zhuofu Tao, Malihe Alikhani, Teruko Mitamura, Eduard Hovy, Varun Gangal
2022 arXiv   pre-print
We investigate the use of multimodal information contained in images as an effective method for enhancing the commonsense of Transformer models for text generation.  ...  We call our approach VisCTG: Visually Grounded Concept-to-Text Generation.  ...  In this paper, we show this is true by improving the performance of Transformer-based text generation models on concept-to-text generation using visual grounding, which we call VisCTG: Visually Grounded  ... 
arXiv:2109.03892v3 fatcat:5gan3ol67vbffhv3fqr76j4uom

Video2Commonsense: Generating Commonsense Descriptions to Enrich Video Captioning [article]

Zhiyuan Fang, Tejas Gokhale, Pratyay Banerjee, Chitta Baral, Yezhou Yang
2020 arXiv   pre-print
Captioning is a crucial and challenging task for video understanding. In videos that involve active agents such as humans, the agent's actions can bring about myriad changes in the scene.  ...  We present the first work on generating commonsense captions directly from videos, to describe latent aspects such as intentions, effects, and attributes.  ...  In this paper we propose the Video to Commonsense (V2C) framework to generate visually grounded commonsense descriptions about the underlying event in the video, enriching the factual description provided  ... 
arXiv:2003.05162v3 fatcat:xgri7zaajjejhmujw5crlxmnti

KM-BART: Knowledge Enhanced Multimodal BART for Visual Commonsense Generation [article]

Yiran Xing, Zai Shi, Zhao Meng, Gerhard Lakemeyer, Yunpu Ma, Roger Wattenhofer
2021 arXiv   pre-print
We further develop novel pretraining tasks to improve the model performance on the Visual Commonsense Generation (VCG) task.  ...  In particular, our pretraining task of Knowledge-based Commonsense Generation (KCG) boosts model performance on the VCG task by leveraging commonsense knowledge from a large language model pretrained on  ...  To ease this problem, researchers have proposed various models (Zhou et al., 2020; for generating texts based on visual inputs.  ... 
arXiv:2101.00419v2 fatcat:txtz26swevdyhijs4qjjbwergm

Contextualized Scene Imagination for Generative Commonsense Reasoning [article]

PeiFeng Wang, Jonathan Zamora, Junfeng Liu, Filip Ilievski, Muhao Chen, Xiang Ren
2022 arXiv   pre-print
However, such generative commonsense reasoning (GCSR) skills are lacking in state-of-the-art text generation methods.  ...  Descriptive sentences about arbitrary concepts generated by neural text generation models (e.g., pre-trained text-to-text Transformers) are often grammatically fluent but may not correspond to human common  ...  ACKNOWLEDGMENTS We thank the anonymous reviewers and all the collaborators in USC INK research lab for their valuable feedback.  ... 
arXiv:2112.06318v3 fatcat:2c2dazlqpzdvjp274gtboofgsu

VL-BERT: Pre-training of Generic Visual-Linguistic Representations [article]

Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, Jifeng Dai
2020 arXiv   pre-print
To better exploit the generic representation, we pre-train VL-BERT on the massive-scale Conceptual Captions dataset, together with text-only corpus.  ...  We introduce a new pre-trainable generic representation for visual-linguistic tasks, called Visual-Linguistic BERT (VL-BERT for short).  ...  There lacks a common ground for studying the feature design and pretraining of visual-linguistic tasks in general.  ... 
arXiv:1908.08530v4 fatcat:venc4egmz5hhbe4oeyt5f2wgku

Visual Distant Supervision for Scene Graph Generation [article]

Yuan Yao, Ao Zhang, Xu Han, Mengdi Li, Cornelius Weber, Zhiyuan Liu, Stefan Wermter, Maosong Sun
2021 arXiv   pre-print
for predicate classification in Visual Genome evaluation).  ...  The intuition is that by aligning commonsense knowledge bases and images, we can automatically create large-scale labeled data to provide distant supervision for visual relation learning.  ...  This work is jointly funded by the Natural Science Foundation of China (NSFC) and the German Research Foundation (DFG) in Project Crossmodal Learning, NSFC 62061136001 / DFG TRR-169.  ... 
arXiv:2103.15365v2 fatcat:jjgqhd43qzerhcsq4z4i5gzmdm

CommonGen: A Constrained Text Generation Challenge for Generative Commonsense Reasoning [article]

Bill Yuchen Lin, Wangchunshu Zhou, Ming Shen, Pei Zhou, Chandra Bhagavatula, Yejin Choi, Xiang Ren
2020 arXiv   pre-print
In this paper, we present a constrained text generation task, CommonGen associated with a benchmark dataset, to explicitly test machines for the ability of generative commonsense reasoning.  ...  Experiments show that there is a large gap between state-of-the-art text generation models (e.g., T5) and human performance.  ...  More specifically, we collect visually-grounded sentences from several existing caption datasets, including image captioning datasets, such as Flickr30k (Young et al., 2014) , MSCOCO (Lin et al., 2014  ... 
arXiv:1911.03705v4 fatcat:iuxuuphxdzexjag2hfi7p622wq

How a General-Purpose Commonsense Ontology can Improve Performance of Learning-Based Image Retrieval [article]

Rodrigo Toro Icarte, Jorge A. Baier, Cristian Ruz, Alvaro Soto
2017 arXiv   pre-print
Current state-of-the-art approaches for visual recognition do not exploit these rule-based knowledge sources. Instead, they learn recognition models directly from training examples.  ...  Consequently, a main conclusion of this work is that general-purpose commonsense ontologies improve performance on visual reasoning tasks when properly filtered to select meaningful visual relations.  ...  Conclusions and Perspectives This paper presented an approach to enhancing a learningbased technique for sentence-based image retrieval with general-purpose knowledge provided by ConceptNet, a large commonsense  ... 
arXiv:1705.08844v1 fatcat:q7ilmnh7vnhqtiflv43p4dhide

Image-Grounded Conversations: Multimodal Context for Natural Question and Response Generation [article]

Nasrin Mostafazadeh, Chris Brockett, Bill Dolan, Michel Galley, Jianfeng Gao, Georgios P. Spithourakis, Lucy Vanderwende
2017 arXiv   pre-print
Experiments with models trained on social media data show that the combination of visual and textual context enhances the quality of generated conversational turns.  ...  We present a novel task, Image-Grounded Conversations (IGC), in which natural-sounding conversations are generated about a shared image.  ...  Retrieval Models In addition to generation, we implemented two retrieval models customized for the tasks of question and response generation.  ... 
arXiv:1701.08251v2 fatcat:psa5hxyiefaszlnkjbgx2mjxey

How a General-Purpose Commonsense Ontology can Improve Performance of Learning-Based Image Retrieval

Rodrigo Toro Icarte, Jorge A. Baier, Cristian Ruz, Alvaro Soto
2017 Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence  
Current state-of-the-art approaches for visual recognition do not exploit these rule-based knowledge sources. Instead, they learn recognition models directly from training examples.  ...  Consequently, a main conclusion of this work is that general-purpose commonsense ontologies improve performance on visual reasoning tasks when properly filtered to select meaningful visual relations.  ...  This is illustrated by the experimental data that showed that information in the ontology alone did not improve performance, while the combination of an ontology and crowd-sourced visual knowledge (from  ... 
doi:10.24963/ijcai.2017/178 dblp:conf/ijcai/IcarteBRS17 fatcat:dgvlaqbv6fa6xd2exfe3247yqe

C3VQG: Category Consistent Cyclic Visual Question Generation [article]

Shagun Uppal, Anish Madan, Sarthak Bhagat, Yi Yu, Rajiv Ratn Shah
2021 arXiv   pre-print
In this paper, we try to exploit the different visual cues and concepts in an image to generate questions using a variational autoencoder (VAE) without ground-truth answers.  ...  Additionally, we also impose supplementary constraints on the latent space of our generative model to provide structure based on categories and enhance generalization by encapsulating decorrelated features  ...  ACKNOWLEDGEMENTS Rajiv Ratn Shah is partly supported by the Infosys Center for AI at IIIT Delhi.  ... 
arXiv:2005.07771v5 fatcat:numf777pq5dsdbbsvjapx44gya

Unicoder-VL: A Universal Encoder for Vision and Language by Cross-Modal Pre-Training

Gen Li, Nan Duan, Yuejian Fang, Ming Gong, Daxin Jiang
2020 PROCEEDINGS OF THE THIRTIETH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE AND THE TWENTY-EIGHTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE  
After pretraining on large-scale image-caption pairs, we transfer Unicoder-VL to caption-based image-text retrieval and visual commonsense reasoning, with just one additional output layer.  ...  et al. 2017) for the cross-modal pre-training, where three pre-trained tasks are employed, including Masked Language Modeling(MLM), Masked Object Classification(MOC) and Visual-linguistic Matching(VLM  ...  Acknowledgments We thank the anonymous reviewers for their helpful comments and discussions.  ... 
doi:10.1609/aaai.v34i07.6795 fatcat:6phdeb33qfdonawtnkbbkaya54

Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training [article]

Gen Li, Nan Duan, Yuejian Fang, Ming Gong, Daxin Jiang, Ming Zhou
2019 arXiv   pre-print
After pretraining on large-scale image-caption pairs, we transfer Unicoder-VL to caption-based image-text retrieval and visual commonsense reasoning, with just one additional output layer.  ...  Borrow ideas from cross-lingual pre-trained models, such as XLM and Unicoder, both visual and linguistic contents are fed into a multi-layer Transformer for the cross-modal pre-training, where three pre-trained  ...  Acknowledgments We thank the anonymous reviewers for their helpful comments and discussions.  ... 
arXiv:1908.06066v3 fatcat:2wzkimv43fbg5lapnc3cbbtksy

Conditional Text Generation for Harmonious Human-Machine Interaction [article]

Bin Guo, Hao Wang, Yasan Ding, Wei Wu, Shaoyang Hao, Yueqi Sun, Zhiwen Yu
2020 arXiv   pre-print
We first summary several key techniques and illustrate the technical evolution route in the field of neural text generation, based on the concept model of CTG.  ...  In recent years, with the development of deep learning, text generation technology has undergone great changes and provided many kinds of services for human beings, such as restaurant reservation and daily  ...  In this paper, we rst make a brief summary of the development history of text generation technology and then give the formal de nitions of di erent types of c-TextGen.  ... 
arXiv:1909.03409v2 fatcat:s2zfmwxtubgwjks4luoby6vdoq

VisualCOMET: Reasoning about the Dynamic Context of a Still Image [article]

Jae Sung Park, Chandra Bhagavatula, Roozbeh Mottaghi, Ali Farhadi, Yejin Choi
2020 arXiv   pre-print
In addition, we provide person-grounding (i.e., co-reference links) between people appearing in the image and people mentioned in the textual commonsense descriptions, allowing for tighter integration  ...  between images and text.  ...  N66001-19-2-4031), and gifts from Allen Institute for Artificial Intelligence.  ... 
arXiv:2004.10796v3 fatcat:xodcxsclmzgp7m6vooygu4e66a
« Previous Showing results 1 — 15 out of 348 results