A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2019; you can also visit the original URL.
The file type is application/pdf
.
Filters
Bidirectional Attentive Fusion with Context Gating for Dense Video Captioning
2018
2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
Second, different events ending at (nearly) the same time are indistinguishable in the previous works, resulting in the same captions. ...
We propose a bidirectional proposal method that effectively exploits both past and future contexts to make proposal predictions. ...
The attended visual feature is generated by a weighted sum: v t = p i=1 α t i · v i+m−1 . (9) We expect the model can better locate "key frames" and produce more semantic correlated words by involving ...
doi:10.1109/cvpr.2018.00751
dblp:conf/cvpr/WangJ00X18
fatcat:l3mc7jrzhna4djoe6okz4x7vdy
Bidirectional Attentive Fusion with Context Gating for Dense Video Captioning
[article]
2018
arXiv
pre-print
Second, different events ending at (nearly) the same time are indistinguishable in the previous works, resulting in the same captions. ...
We propose a bidirectional proposal method that effectively exploits both past and future contexts to make proposal predictions. ...
Captioning Performance on Different Activity Categories In Fig. 8 , we provide detailed dense captioning performance for videos from different activity categories. ...
arXiv:1804.00100v2
fatcat:snnkr2e2fzcqxjghqaafqnsdzq
Semi Supervised Phrase Localization in a Bidirectional Caption-Image Retrieval Framework
[article]
2019
arXiv
pre-print
The model's retrieval and localization performance is evaluated on MSCOCO and Flickr30K Entities datasets. ...
To accomplish this task, our architecture makes use of the rich semantic information available in a joint embedding space of multi-modal data. ...
We hypothesize that by using such a proxy task for our neural network framework, we can implicitly capture visual associations of caption tokens and objects located at different spatial locations in the ...
arXiv:1908.02950v1
fatcat:areuscjwnba4nf26lhztfsv2im
Natural Language Description of Videos for Smart Surveillance
2021
Applied Sciences
After the September 11 attacks, security and surveillance measures have changed across the globe. Now, surveillance cameras are installed almost everywhere to monitor video footage. ...
This framework is based on the multitask learning of high-level features (HLFs) using a convolutional neural network (CNN) and natural language generation (NLG) through bidirectional recurrent networks ...
CCTV (Closed-Circuit Television) security cameras were installed in these locations. Each location consisted of 300 videos. The span of each video lied between 7 and 10 s. ...
doi:10.3390/app11093730
fatcat:62o2odoqn5gajk46grynsxaa74
UniVSE: Robust Visual Semantic Embeddings via Structured Semantic Representations
[article]
2019
arXiv
pre-print
We propose Unified Visual-Semantic Embeddings (UniVSE) for learning a joint space of visual and textual concepts. ...
The space unifies the concepts at different levels, including objects, attributes, relations, and full scenes. ...
Ablation study: semantic components. We now delve into the effectiveness of different semantic components by choosing different combinations of components for the caption embedding. ...
arXiv:1904.05521v2
fatcat:kfsldaebbvawbf7xwd7yuy7izu
Can Shuffling Video Benefit Temporal Bias Problem: A Novel Training Framework for Temporal Grounding
[article]
2022
arXiv
pre-print
These methods do not reason the target moment locations based on the visual-textual semantic alignment but over-rely on the temporal biases of queries in training sets. ...
Extensive experiments on Charades-STA and ActivityNet Captions demonstrate the effectiveness of our method for mitigating the reliance on temporal biases and strengthening the model's generalization ability ...
The cross-modal matching task requires that the model predicts as consistent frame-level cross-modal relevance as possible for target moments, even if their temporal positions change. ...
arXiv:2207.14698v1
fatcat:2srhrtljovby7ijznxgbvjpvwu
Text2Scene: Generating Compositional Scenes From Textual Descriptions
2019
2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Text2Scene instead learns to sequentially generate objects and their attributes (location, size, appearance, etc) at every time step by attending to different parts of the input text and the current status ...
images, and synthetic images. ...
Acknowledgements: This work was partially supported by an IBM Faculty Award to V.O, and gift funding from SAP Research. ...
doi:10.1109/cvpr.2019.00687
dblp:conf/cvpr/TanFO19
fatcat:yzi744fv55hnrgmvvtybyu7mbe
Coarse-to-Fine Spatial-Temporal Relationship Inference for Temporal Sentence Grounding
2021
IEEE Access
Moreover, this task needs to detect the location clues precisely from both spatial and temporal dimensions, but the relationship between spatial-temporal semantic information and query sentence is still ...
Firstly, we present a coarse-grained crucial frame selection module, where the query-guided local difference context modeling from adjacent frames helps discriminate all the coarse boundary locations relevant ...
The difference operation benefits recognizing the sentence-relevant interactions or scene changes. ...
doi:10.1109/access.2021.3095229
fatcat:jow3mzxavfaohemaxgk4d2buey
Generating Images from Captions with Attention
[article]
2016
arXiv
pre-print
We demonstrate that our model produces higher quality samples than other approaches and generates images with novel scene compositions corresponding to previously unseen captions in the dataset. ...
After training on Microsoft COCO, we compare our model with several baseline generative models on image generation and retrieval tasks. ...
., 2015) for open sourcing their code, and Ryan Kiros and Nitish Srivastava for helpful discussions. ...
arXiv:1511.02793v2
fatcat:vrpchw52wbcc5mv34iqwqg7i7m
Probing Multimodal Embeddings for Linguistic Properties: the Visual-Semantic Case
[article]
2021
arXiv
pre-print
Semantic embeddings have advanced the state of the art for countless natural language processing tasks, and various extensions to multimodal domains, such as visual-semantic embeddings, have been proposed ...
To this end, we (i) discuss the formalization of probing tasks for embeddings of image-caption pairs, (ii) define three concrete probing tasks within our general framework, (iii) train classifiers to probe ...
Acknowledgements This work has been funded by Eureka and the Swedish Governmental Agency for Innovation Systems through the Eurostars program, grant agreement no. 11776. ...
arXiv:2102.11115v1
fatcat:f6l5gzk7hjdgjayvzgnw7ux4ha
Text2Scene: Generating Compositional Scenes from Textual Descriptions
[article]
2019
arXiv
pre-print
Text2Scene instead learns to sequentially generate objects and their attributes (location, size, appearance, etc) at every time step by attending to different parts of the input text and the current status ...
images, and synthetic images. ...
Model Text2Scene adopts a sequence-to-sequence approach [31] and introduces key designs for spatial and sequential reasoning. ...
arXiv:1809.01110v3
fatcat:xhpqjfczx5htdg5sfbck2cw3em
Chinese Event Extraction Based on Attention and Semantic Features: A Bidirectional Circular Neural Network
2018
Future Internet
on attention and semantic features. ...
With the semantic feature, we can obtain some more information about a word from the sentence. We evaluate different methods on the CEC Corpus, and this method is found to improve performance. ...
Acknowledgments: We would like to thank our colleges for their suggestions and help.
Conflicts of Interest: The authors declare no conflict of interest. ...
doi:10.3390/fi10100095
fatcat:c5bcchfnjnglbapfzhcc5zjqbq
Multimodal Research in Vision and Language: A Review of Current and Emerging Trends
[article]
2020
arXiv
pre-print
We look at its applications in their task formulations and how to solve various problems related to semantic perception and content generation. ...
Deep Learning and its applications have cascaded impactful research and development with a diverse range of modalities present in the real-world data. ...
[390] synthesized similar yet different images than ground truth and then studied how and why the answers change with differing visual distortions. ...
arXiv:2010.09522v2
fatcat:l4npstkoqndhzn6hznr7eeys4u
Multimodal feature fusion based on object relation for video captioning
2022
CAAI Transactions on Intelligence Technology
The multimodal feature fusion network is used to fuse the features of different modals. ...
However, most of the existing methods in the video captioning task ignore the relationship between objects in the video and the correlation between multimodal features, and they also ignore the effect ...
ACKNOWLEDGMENTS This work was supported by the National Natural Science Foundation of China under Grant 62077015 and the Key Laboratory of Intelligent Education Technology and Application of Zhejiang Province ...
doi:10.1049/cit2.12071
fatcat:bmyvu6sr6zbqtac6jxfqiygutu
X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers
[article]
2020
arXiv
pre-print
Our analysis of a popular representative from this model family - LXMERT - finds that it is unable to generate rich and semantically meaningful imagery with its current training setup. ...
X-LXMERT's image generation capabilities rival state of the art generative models while its question answering and captioning abilities remains comparable to LXMERT. ...
For R-precision-easy, we sample 99 negative captions for each caption, where all negative captions correspond to different val2014 images. ...
arXiv:2009.11278v1
fatcat:td3n3swsajgujkrtrvkwdzykhq
« Previous
Showing results 1 — 15 out of 1,177 results