A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2020; you can also visit the original URL.
The file type is application/pdf
.
Filters
Fusion Models for Improved Visual Captioning
[article]
2020
arXiv
pre-print
Next, we employ the same fusion strategies to integrate a pretrained Masked Language Model (MLM), namely BERT, with a visual captioning model, viz. ...
fusion framework for caption generation as well as emendation where we utilize different fusion strategies to integrate a pretrained Auxiliary Language Model (AuxLM) within the traditional encoder-decoder ...
Building on these developments, we propose to incorporate external language models into visual captioning frameworks to aid and improve their capabilities both for description generation and emendation ...
arXiv:2010.15251v2
fatcat:xs4qgzicyfdyzkotqi6bfndlu4
Improving Visual Question Answering by Referring to Generated Paragraph Captions
2019
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
Overall, our joint model, when trained on the Visual Genome dataset, significantly improves the VQA performance over a strong baseline model. ...
These paragraph captions can hence contain substantial information of the image for tasks such as visual question answering. ...
Acknowledgments We thank the reviewers for their helpful comments. ...
doi:10.18653/v1/p19-1351
dblp:conf/acl/KimB19
fatcat:47smsygpcrdghfzu2jgsaiefaa
Can images help recognize entities? A study of the role of images for Multimodal NER
[article]
2021
arXiv
pre-print
We also study the use of captions as a way to enrich the context for MNER. ...
Experiments on three datasets from popular social platforms expose the bottleneck of existing multimodal models and the situations where using captions is beneficial. ...
We would like to thank the members from the RiT-UAL lab at the University of Houston for their invaluable feedback. We also thank the anonymous W-NUT reviewers for their valuable suggestions. ...
arXiv:2010.12712v2
fatcat:2lyjfpgaurchdihzeudl5l6vyq
Dense Captioning with Joint Inference and Visual Context
[article]
2017
arXiv
pre-print
Our final model, compact and efficient, achieves state-of-the-art accuracy on Visual Genome for dense captioning with a relative gain of 73\% compared to the previous best algorithm. ...
We propose a new model pipeline based on two novel ideas, joint inference and context fusion, to alleviate these two challenges. ...
Here, we see similar results as on V1.0, which further verifies the advantage of T-LSTM over S-LSTM (mAP 8.16 vs 6.44 for no-context), and that context fusion greatly improves performance for both models ...
arXiv:1611.06949v2
fatcat:gm2xmjbq65edhpxn5qrlpdnswy
Leveraging Visual Question Answering for Image-Caption Ranking
[article]
2016
arXiv
pre-print
We propose score-level and representation-level fusion models to incorporate VQA knowledge in an existing state-of-the-art VQA-agnostic image-caption ranking model. ...
Concretely, our model improves state-of-the-art on caption retrieval by 7.1% and on image retrieval by 4.4% on the MSCOCO dataset. ...
Benefits of using such semantic mid-level visual representations include improving fine-grained visual recognition, learning models of visual concepts without example images (zero-shot learning [30, 39 ...
arXiv:1605.01379v2
fatcat:eugr5hoo4rgptca7jdm75vd2jm
Dense Captioning with Joint Inference and Visual Context
2017
2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Our final model, compact and efficient, achieves state-of-the-art accuracy on Visual Genome [23] for dense captioning with a relative gain of 73% compared to the previous best algorithm. ...
We propose a new model pipeline based on two novel ideas, joint inference and context fusion, to alleviate these two challenges. ...
(a) For a region proposal, the bounding box can adapt and improve with the caption word by word. ...
doi:10.1109/cvpr.2017.214
dblp:conf/cvpr/YangTYL17
fatcat:7mcgtr3oinag5pnwob5bqviglq
Triple Sequence Generative Adversarial Nets for Unsupervised Image Captioning
2021
ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
The experimental results demonstrate that our method achieves significant improvements as compared to all baselines. ...
In the experiments, we use a large number of unpaired images and sentences to train our model on the unsupervised and unpaired setting. ...
The con-to-sen model has worse performance as compared to our TSGAN model, which indicates that introducing the visual information into the captioning model can effectively improve the performance of it ...
doi:10.1109/icassp39728.2021.9414335
fatcat:76phn3ilwnhmtasabefmtxxzpy
Improving LSTM-based Video Description with Linguistic Knowledge Mined from Text
2016
arXiv
pre-print
Specifically, we integrate both a neural language model and distributional semantics trained on large text corpora into a recent LSTM-based architecture for video description. ...
We evaluate our approach on a collection of Youtube videos as well as two large movie description datasets showing significant improvements in grammaticality while modestly improving descriptive quality ...
Gulcehre et al. (2015) developed an LSTM model for machine translation that incorporates a monolingual language model for the target language showing improved results. ...
arXiv:1604.01729v2
fatcat:gr6mmqbvkbfz7omrmthaqfbjn4
Watch, Listen and Tell: Multi-modal Weakly Supervised Dense Event Captioning
[article]
2019
arXiv
pre-print
Specifically, we focus on the problem of weakly-supervised dense event captioning in videos and show that audio on its own can nearly rival performance of a state-of-the-art visual model and, combined ...
with video, can improve on the state-of-the-art performance. ...
Acknowledgments: This work was funded in part by the Vector Institute for AI, Canada CIFAR AI Chair, NSERC Canada Research Chair (CRC) and an NSERC Discovery and Discovery Accelerator Supplement Grants ...
arXiv:1909.09944v2
fatcat:paj6fq6mwvdmfk73ubfz55jg4a
Improving LSTM-based Video Description with Linguistic Knowledge Mined from Text
2016
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing
Specifically, we integrate both a neural language model and distributional semantics trained on large text corpora into a recent LSTM-based architecture for video description. ...
We evaluate our approach on a collection of Youtube videos as well as two large movie description datasets showing significant improvements in grammaticality while modestly improving descriptive quality ...
Gulcehre et al. (2015) developed an LSTM model for machine translation that incorporates a monolingual language model for the target language showing improved results. ...
doi:10.18653/v1/d16-1204
dblp:conf/emnlp/VenugopalanHMS16
fatcat:rkxhj2pojvdcpdovdwtfi2qqzm
Vision and Language Integration Meets Multimedia Fusion
2018
IEEE Multimedia
Recent advances in deep learning have opened up new opportunities in joint modeling of visual and co-occurring verbal information in multimedia. ...
Prototype systems have implemented early or late fusion of modality-specific processing results through various methodologies including rule-based approaches, informationtheoretic models, and machine learning ...
ACKNOWLEDGMENT We thank all the reviewers who contributed to the selection of articles for this special issue. ...
doi:10.1109/mmul.2018.023121160
fatcat:g2zcgejxpfg75jawrdjqofdaqq
CIGLI: Conditional Image Generation from Language Image
[article]
2021
arXiv
pre-print
We then propose a novel language-image fusion model which improves the performance over two established baseline methods, as evaluated by quantitative (automatic) and qualitative (human) evaluations. ...
We improve the model through fine-tuning the model and fusion of image features. The two model architectures are reflected in Figure 2 . ...
We also propose a new image-text fusion model based on DF-GAN, which improves the performance compared with two baseline models. ...
arXiv:2108.08955v1
fatcat:ayxe4gf24zeyfhvnwrdpv5z7fy
VIREO @ TRECVID 2017: Video-to-Text, Ad-hoc Video Search, and Video hyperlinking
2017
TREC Video Retrieval Evaluation
In this study, we intend to find whether the combination of the concept-based system, captioning system and text-based search system would do any help to improve search performance. ...
-No-spatial-temporal attention model: similarity scores from the above three models are averagely fused for the final ranking. ...
For the combination of concepts and captioning, the manual process is to manually assign fusion weights to three different components: concepts, captioning-ResNet152, captioning-C3D. ...
dblp:conf/trecvid/NguyenLCL00N17
fatcat:jjfp3n7qunfg5mvnrmdmf4m5dq
A Frustratingly Simple Approach for End-to-End Image Captioning
[article]
2022
arXiv
pre-print
As a result, we do not need extra object detectors for model training. ...
In addition, the errors of the object detectors are easy to propagate to the following captioning models, degenerating models' performance. ...
[16] indicate that using a larger learning rate for the randomly initialized cross-modal fusion module during pre-training can improve model performance on the downstream tasks. ...
arXiv:2201.12723v3
fatcat:ix4bz4aigzc2pd3uamxjswjgia
Cascade Semantic Fusion for Image Captioning
2019
IEEE Access
The empirical analysis shows that the CSF can assist image captioning model in selecting the object regions of interest. ...
INDEX TERMS Attention mechanism, feature fusion, image captioning. ...
Li and Chen [19] devised a visual-semantic LSTM model to explore the inner connections of visual and semantic features for image captioning via two LSTMs. ...
doi:10.1109/access.2019.2917979
fatcat:rsrjp63whfgbnh2cbua6piqrvq
« Previous
Showing results 1 — 15 out of 8,381 results