Filters








8,381 Hits in 4.0 sec

Fusion Models for Improved Visual Captioning [article]

Marimuthu Kalimuthu, Aditya Mogadala, Marius Mosbach, Dietrich Klakow
2020 arXiv   pre-print
Next, we employ the same fusion strategies to integrate a pretrained Masked Language Model (MLM), namely BERT, with a visual captioning model, viz.  ...  fusion framework for caption generation as well as emendation where we utilize different fusion strategies to integrate a pretrained Auxiliary Language Model (AuxLM) within the traditional encoder-decoder  ...  Building on these developments, we propose to incorporate external language models into visual captioning frameworks to aid and improve their capabilities both for description generation and emendation  ... 
arXiv:2010.15251v2 fatcat:xs4qgzicyfdyzkotqi6bfndlu4

Improving Visual Question Answering by Referring to Generated Paragraph Captions

Hyounghun Kim, Mohit Bansal
2019 Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics  
Overall, our joint model, when trained on the Visual Genome dataset, significantly improves the VQA performance over a strong baseline model.  ...  These paragraph captions can hence contain substantial information of the image for tasks such as visual question answering.  ...  Acknowledgments We thank the reviewers for their helpful comments.  ... 
doi:10.18653/v1/p19-1351 dblp:conf/acl/KimB19 fatcat:47smsygpcrdghfzu2jgsaiefaa

Can images help recognize entities? A study of the role of images for Multimodal NER [article]

Shuguang Chen, Gustavo Aguilar, Leonardo Neves, Thamar Solorio
2021 arXiv   pre-print
We also study the use of captions as a way to enrich the context for MNER.  ...  Experiments on three datasets from popular social platforms expose the bottleneck of existing multimodal models and the situations where using captions is beneficial.  ...  We would like to thank the members from the RiT-UAL lab at the University of Houston for their invaluable feedback. We also thank the anonymous W-NUT reviewers for their valuable suggestions.  ... 
arXiv:2010.12712v2 fatcat:2lyjfpgaurchdihzeudl5l6vyq

Dense Captioning with Joint Inference and Visual Context [article]

Linjie Yang, Kevin Tang, Jianchao Yang, Li-Jia Li
2017 arXiv   pre-print
Our final model, compact and efficient, achieves state-of-the-art accuracy on Visual Genome for dense captioning with a relative gain of 73\% compared to the previous best algorithm.  ...  We propose a new model pipeline based on two novel ideas, joint inference and context fusion, to alleviate these two challenges.  ...  Here, we see similar results as on V1.0, which further verifies the advantage of T-LSTM over S-LSTM (mAP 8.16 vs 6.44 for no-context), and that context fusion greatly improves performance for both models  ... 
arXiv:1611.06949v2 fatcat:gm2xmjbq65edhpxn5qrlpdnswy

Leveraging Visual Question Answering for Image-Caption Ranking [article]

Xiao Lin, Devi Parikh
2016 arXiv   pre-print
We propose score-level and representation-level fusion models to incorporate VQA knowledge in an existing state-of-the-art VQA-agnostic image-caption ranking model.  ...  Concretely, our model improves state-of-the-art on caption retrieval by 7.1% and on image retrieval by 4.4% on the MSCOCO dataset.  ...  Benefits of using such semantic mid-level visual representations include improving fine-grained visual recognition, learning models of visual concepts without example images (zero-shot learning [30, 39  ... 
arXiv:1605.01379v2 fatcat:eugr5hoo4rgptca7jdm75vd2jm

Dense Captioning with Joint Inference and Visual Context

Linjie Yang, Kevin Tang, Jianchao Yang, Li-Jia Li
2017 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)  
Our final model, compact and efficient, achieves state-of-the-art accuracy on Visual Genome [23] for dense captioning with a relative gain of 73% compared to the previous best algorithm.  ...  We propose a new model pipeline based on two novel ideas, joint inference and context fusion, to alleviate these two challenges.  ...  (a) For a region proposal, the bounding box can adapt and improve with the caption word by word.  ... 
doi:10.1109/cvpr.2017.214 dblp:conf/cvpr/YangTYL17 fatcat:7mcgtr3oinag5pnwob5bqviglq

Triple Sequence Generative Adversarial Nets for Unsupervised Image Captioning

Yucheng Zhou, Wei Tao, Wenqiang Zhang
2021 ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)  
The experimental results demonstrate that our method achieves significant improvements as compared to all baselines.  ...  In the experiments, we use a large number of unpaired images and sentences to train our model on the unsupervised and unpaired setting.  ...  The con-to-sen model has worse performance as compared to our TSGAN model, which indicates that introducing the visual information into the captioning model can effectively improve the performance of it  ... 
doi:10.1109/icassp39728.2021.9414335 fatcat:76phn3ilwnhmtasabefmtxxzpy

Improving LSTM-based Video Description with Linguistic Knowledge Mined from Text

Subhashini Venugopalan, Lisa Anne Hendricks, Raymond Mooney, Kate Saenko
2016 arXiv   pre-print
Specifically, we integrate both a neural language model and distributional semantics trained on large text corpora into a recent LSTM-based architecture for video description.  ...  We evaluate our approach on a collection of Youtube videos as well as two large movie description datasets showing significant improvements in grammaticality while modestly improving descriptive quality  ...  Gulcehre et al. (2015) developed an LSTM model for machine translation that incorporates a monolingual language model for the target language showing improved results.  ... 
arXiv:1604.01729v2 fatcat:gr6mmqbvkbfz7omrmthaqfbjn4

Watch, Listen and Tell: Multi-modal Weakly Supervised Dense Event Captioning [article]

Tanzila Rahman, Bicheng Xu, Leonid Sigal
2019 arXiv   pre-print
Specifically, we focus on the problem of weakly-supervised dense event captioning in videos and show that audio on its own can nearly rival performance of a state-of-the-art visual model and, combined  ...  with video, can improve on the state-of-the-art performance.  ...  Acknowledgments: This work was funded in part by the Vector Institute for AI, Canada CIFAR AI Chair, NSERC Canada Research Chair (CRC) and an NSERC Discovery and Discovery Accelerator Supplement Grants  ... 
arXiv:1909.09944v2 fatcat:paj6fq6mwvdmfk73ubfz55jg4a

Improving LSTM-based Video Description with Linguistic Knowledge Mined from Text

Subhashini Venugopalan, Lisa Anne Hendricks, Raymond Mooney, Kate Saenko
2016 Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing  
Specifically, we integrate both a neural language model and distributional semantics trained on large text corpora into a recent LSTM-based architecture for video description.  ...  We evaluate our approach on a collection of Youtube videos as well as two large movie description datasets showing significant improvements in grammaticality while modestly improving descriptive quality  ...  Gulcehre et al. (2015) developed an LSTM model for machine translation that incorporates a monolingual language model for the target language showing improved results.  ... 
doi:10.18653/v1/d16-1204 dblp:conf/emnlp/VenugopalanHMS16 fatcat:rkxhj2pojvdcpdovdwtfi2qqzm

Vision and Language Integration Meets Multimedia Fusion

Marie-Francine Moens, Katerina Pastra, Kate Saenko, Tinne Tuytelaars
2018 IEEE Multimedia  
Recent advances in deep learning have opened up new opportunities in joint modeling of visual and co-occurring verbal information in multimedia.  ...  Prototype systems have implemented early or late fusion of modality-specific processing results through various methodologies including rule-based approaches, informationtheoretic models, and machine learning  ...  ACKNOWLEDGMENT We thank all the reviewers who contributed to the selection of articles for this special issue.  ... 
doi:10.1109/mmul.2018.023121160 fatcat:g2zcgejxpfg75jawrdjqofdaqq

CIGLI: Conditional Image Generation from Language Image [article]

Xiaopeng Lu, Lynnette Ng, Jared Fernandez, Hao Zhu
2021 arXiv   pre-print
We then propose a novel language-image fusion model which improves the performance over two established baseline methods, as evaluated by quantitative (automatic) and qualitative (human) evaluations.  ...  We improve the model through fine-tuning the model and fusion of image features. The two model architectures are reflected in Figure 2 .  ...  We also propose a new image-text fusion model based on DF-GAN, which improves the performance compared with two baseline models.  ... 
arXiv:2108.08955v1 fatcat:ayxe4gf24zeyfhvnwrdpv5z7fy

VIREO @ TRECVID 2017: Video-to-Text, Ad-hoc Video Search, and Video hyperlinking

Phuong Anh Nguyen, Qing Li, Zhi-Qi Cheng, Yi-Jie Lu, Hao Zhang, Xiao Wu, Chong-Wah Ngo
2017 TREC Video Retrieval Evaluation  
In this study, we intend to find whether the combination of the concept-based system, captioning system and text-based search system would do any help to improve search performance.  ...  -No-spatial-temporal attention model: similarity scores from the above three models are averagely fused for the final ranking.  ...  For the combination of concepts and captioning, the manual process is to manually assign fusion weights to three different components: concepts, captioning-ResNet152, captioning-C3D.  ... 
dblp:conf/trecvid/NguyenLCL00N17 fatcat:jjfp3n7qunfg5mvnrmdmf4m5dq

A Frustratingly Simple Approach for End-to-End Image Captioning [article]

Ziyang Luo, Yadong Xi, Rongsheng Zhang, Jing Ma
2022 arXiv   pre-print
As a result, we do not need extra object detectors for model training.  ...  In addition, the errors of the object detectors are easy to propagate to the following captioning models, degenerating models' performance.  ...  [16] indicate that using a larger learning rate for the randomly initialized cross-modal fusion module during pre-training can improve model performance on the downstream tasks.  ... 
arXiv:2201.12723v3 fatcat:ix4bz4aigzc2pd3uamxjswjgia

Cascade Semantic Fusion for Image Captioning

Shiwei Wang, Long Lan, Xiang Zhang, Guohua Dong, Zhigang Luo
2019 IEEE Access  
The empirical analysis shows that the CSF can assist image captioning model in selecting the object regions of interest.  ...  INDEX TERMS Attention mechanism, feature fusion, image captioning.  ...  Li and Chen [19] devised a visual-semantic LSTM model to explore the inner connections of visual and semantic features for image captioning via two LSTMs.  ... 
doi:10.1109/access.2019.2917979 fatcat:rsrjp63whfgbnh2cbua6piqrvq
« Previous Showing results 1 — 15 out of 8,381 results