Filters








1,177 Hits in 5.5 sec

Bidirectional Attentive Fusion with Context Gating for Dense Video Captioning

Jingwen Wang, Wenhao Jiang, Lin Ma, Wei Liu, Yong Xu
2018 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition  
Second, different events ending at (nearly) the same time are indistinguishable in the previous works, resulting in the same captions.  ...  We propose a bidirectional proposal method that effectively exploits both past and future contexts to make proposal predictions.  ...  The attended visual feature is generated by a weighted sum: v t = p i=1 α t i · v i+m−1 . (9) We expect the model can better locate "key frames" and produce more semantic correlated words by involving  ... 
doi:10.1109/cvpr.2018.00751 dblp:conf/cvpr/WangJ00X18 fatcat:l3mc7jrzhna4djoe6okz4x7vdy

Bidirectional Attentive Fusion with Context Gating for Dense Video Captioning [article]

Jingwen Wang, Wenhao Jiang, Lin Ma, Wei Liu, Yong Xu
2018 arXiv   pre-print
Second, different events ending at (nearly) the same time are indistinguishable in the previous works, resulting in the same captions.  ...  We propose a bidirectional proposal method that effectively exploits both past and future contexts to make proposal predictions.  ...  Captioning Performance on Different Activity Categories In Fig. 8 , we provide detailed dense captioning performance for videos from different activity categories.  ... 
arXiv:1804.00100v2 fatcat:snnkr2e2fzcqxjghqaafqnsdzq

Semi Supervised Phrase Localization in a Bidirectional Caption-Image Retrieval Framework [article]

Deepan Das, Noor Mohammed Ghouse, Shashank Verma, Yin Li
2019 arXiv   pre-print
The model's retrieval and localization performance is evaluated on MSCOCO and Flickr30K Entities datasets.  ...  To accomplish this task, our architecture makes use of the rich semantic information available in a joint embedding space of multi-modal data.  ...  We hypothesize that by using such a proxy task for our neural network framework, we can implicitly capture visual associations of caption tokens and objects located at different spatial locations in the  ... 
arXiv:1908.02950v1 fatcat:areuscjwnba4nf26lhztfsv2im

Natural Language Description of Videos for Smart Surveillance

Aniqa Dilawari, Muhammad Usman Ghani Khan, Yasser D. Al-Otaibi, Zahoor-ur Rehman, Atta-ur Rahman, Yunyoung Nam
2021 Applied Sciences  
After the September 11 attacks, security and surveillance measures have changed across the globe. Now, surveillance cameras are installed almost everywhere to monitor video footage.  ...  This framework is based on the multitask learning of high-level features (HLFs) using a convolutional neural network (CNN) and natural language generation (NLG) through bidirectional recurrent networks  ...  CCTV (Closed-Circuit Television) security cameras were installed in these locations. Each location consisted of 300 videos. The span of each video lied between 7 and 10 s.  ... 
doi:10.3390/app11093730 fatcat:62o2odoqn5gajk46grynsxaa74

UniVSE: Robust Visual Semantic Embeddings via Structured Semantic Representations [article]

Hao Wu, Jiayuan Mao, Yufeng Zhang, Yuning Jiang, Lei Li, Weiwei Sun, Wei-Ying Ma
2019 arXiv   pre-print
We propose Unified Visual-Semantic Embeddings (UniVSE) for learning a joint space of visual and textual concepts.  ...  The space unifies the concepts at different levels, including objects, attributes, relations, and full scenes.  ...  Ablation study: semantic components. We now delve into the effectiveness of different semantic components by choosing different combinations of components for the caption embedding.  ... 
arXiv:1904.05521v2 fatcat:kfsldaebbvawbf7xwd7yuy7izu

Can Shuffling Video Benefit Temporal Bias Problem: A Novel Training Framework for Temporal Grounding [article]

Jiachang Hao, Haifeng Sun, Pengfei Ren, Jingyu Wang, Qi Qi, Jianxin Liao
2022 arXiv   pre-print
These methods do not reason the target moment locations based on the visual-textual semantic alignment but over-rely on the temporal biases of queries in training sets.  ...  Extensive experiments on Charades-STA and ActivityNet Captions demonstrate the effectiveness of our method for mitigating the reliance on temporal biases and strengthening the model's generalization ability  ...  The cross-modal matching task requires that the model predicts as consistent frame-level cross-modal relevance as possible for target moments, even if their temporal positions change.  ... 
arXiv:2207.14698v1 fatcat:2srhrtljovby7ijznxgbvjpvwu

Text2Scene: Generating Compositional Scenes From Textual Descriptions

Fuwen Tan, Song Feng, Vicente Ordonez
2019 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)  
Text2Scene instead learns to sequentially generate objects and their attributes (location, size, appearance, etc) at every time step by attending to different parts of the input text and the current status  ...  images, and synthetic images.  ...  Acknowledgements: This work was partially supported by an IBM Faculty Award to V.O, and gift funding from SAP Research.  ... 
doi:10.1109/cvpr.2019.00687 dblp:conf/cvpr/TanFO19 fatcat:yzi744fv55hnrgmvvtybyu7mbe

Coarse-to-Fine Spatial-Temporal Relationship Inference for Temporal Sentence Grounding

Shanshan Qi, Luxi Yang, Chunguo Li, Yongming Huang
2021 IEEE Access  
Moreover, this task needs to detect the location clues precisely from both spatial and temporal dimensions, but the relationship between spatial-temporal semantic information and query sentence is still  ...  Firstly, we present a coarse-grained crucial frame selection module, where the query-guided local difference context modeling from adjacent frames helps discriminate all the coarse boundary locations relevant  ...  The difference operation benefits recognizing the sentence-relevant interactions or scene changes.  ... 
doi:10.1109/access.2021.3095229 fatcat:jow3mzxavfaohemaxgk4d2buey

Generating Images from Captions with Attention [article]

Elman Mansimov, Emilio Parisotto, Jimmy Lei Ba, Ruslan Salakhutdinov
2016 arXiv   pre-print
We demonstrate that our model produces higher quality samples than other approaches and generates images with novel scene compositions corresponding to previously unseen captions in the dataset.  ...  After training on Microsoft COCO, we compare our model with several baseline generative models on image generation and retrieval tasks.  ...  ., 2015) for open sourcing their code, and Ryan Kiros and Nitish Srivastava for helpful discussions.  ... 
arXiv:1511.02793v2 fatcat:vrpchw52wbcc5mv34iqwqg7i7m

Probing Multimodal Embeddings for Linguistic Properties: the Visual-Semantic Case [article]

Adam Dahlgren Lindström, Suna Bensch, Johanna Björklund, Frank Drewes
2021 arXiv   pre-print
Semantic embeddings have advanced the state of the art for countless natural language processing tasks, and various extensions to multimodal domains, such as visual-semantic embeddings, have been proposed  ...  To this end, we (i) discuss the formalization of probing tasks for embeddings of image-caption pairs, (ii) define three concrete probing tasks within our general framework, (iii) train classifiers to probe  ...  Acknowledgements This work has been funded by Eureka and the Swedish Governmental Agency for Innovation Systems through the Eurostars program, grant agreement no. 11776.  ... 
arXiv:2102.11115v1 fatcat:f6l5gzk7hjdgjayvzgnw7ux4ha

Text2Scene: Generating Compositional Scenes from Textual Descriptions [article]

Fuwen Tan, Song Feng, Vicente Ordonez
2019 arXiv   pre-print
Text2Scene instead learns to sequentially generate objects and their attributes (location, size, appearance, etc) at every time step by attending to different parts of the input text and the current status  ...  images, and synthetic images.  ...  Model Text2Scene adopts a sequence-to-sequence approach [31] and introduces key designs for spatial and sequential reasoning.  ... 
arXiv:1809.01110v3 fatcat:xhpqjfczx5htdg5sfbck2cw3em

Chinese Event Extraction Based on Attention and Semantic Features: A Bidirectional Circular Neural Network

Yue Wu, Junyi Zhang
2018 Future Internet  
on attention and semantic features.  ...  With the semantic feature, we can obtain some more information about a word from the sentence. We evaluate different methods on the CEC Corpus, and this method is found to improve performance.  ...  Acknowledgments: We would like to thank our colleges for their suggestions and help. Conflicts of Interest: The authors declare no conflict of interest.  ... 
doi:10.3390/fi10100095 fatcat:c5bcchfnjnglbapfzhcc5zjqbq

Multimodal Research in Vision and Language: A Review of Current and Emerging Trends [article]

Shagun Uppal, Sarthak Bhagat, Devamanyu Hazarika, Navonil Majumdar, Soujanya Poria, Roger Zimmermann, Amir Zadeh
2020 arXiv   pre-print
We look at its applications in their task formulations and how to solve various problems related to semantic perception and content generation.  ...  Deep Learning and its applications have cascaded impactful research and development with a diverse range of modalities present in the real-world data.  ...  [390] synthesized similar yet different images than ground truth and then studied how and why the answers change with differing visual distortions.  ... 
arXiv:2010.09522v2 fatcat:l4npstkoqndhzn6hznr7eeys4u

Multimodal feature fusion based on object relation for video captioning

Zhiwen Yan, Ying Chen, Jinlong Song, Jia Zhu
2022 CAAI Transactions on Intelligence Technology  
The multimodal feature fusion network is used to fuse the features of different modals.  ...  However, most of the existing methods in the video captioning task ignore the relationship between objects in the video and the correlation between multimodal features, and they also ignore the effect  ...  ACKNOWLEDGMENTS This work was supported by the National Natural Science Foundation of China under Grant 62077015 and the Key Laboratory of Intelligent Education Technology and Application of Zhejiang Province  ... 
doi:10.1049/cit2.12071 fatcat:bmyvu6sr6zbqtac6jxfqiygutu

X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers [article]

Jaemin Cho, Jiasen Lu, Dustin Schwenk, Hannaneh Hajishirzi, Aniruddha Kembhavi
2020 arXiv   pre-print
Our analysis of a popular representative from this model family - LXMERT - finds that it is unable to generate rich and semantically meaningful imagery with its current training setup.  ...  X-LXMERT's image generation capabilities rival state of the art generative models while its question answering and captioning abilities remains comparable to LXMERT.  ...  For R-precision-easy, we sample 99 negative captions for each caption, where all negative captions correspond to different val2014 images.  ... 
arXiv:2009.11278v1 fatcat:td3n3swsajgujkrtrvkwdzykhq
« Previous Showing results 1 — 15 out of 1,177 results