Filters








4,413 Hits in 5.1 sec

Semantic Conditioned Dynamic Modulation for Temporal Sentence Grounding in Videos [article]

Yitian Yuan, Lin Ma, Jingwen Wang, Wei Liu, Wenwu Zhu
2019 arXiv   pre-print
In this paper, we propose a novel semantic conditioned dynamic modulation (SCDM) mechanism, which relies on the sentence semantics to modulate the temporal convolution operations for better correlating  ...  Temporal sentence grounding in videos aims to detect and localize one target video segment, which semantically corresponds to a given sentence.  ...  In this paper, we propose a novel semantic conditioned dynamic modulation (SCDM) mechanism, which leverages sentence semantic information to modulate the temporal convolution processes in a hierarchical  ... 
arXiv:1910.14303v1 fatcat:mwmqqmbjhbevfhi4gjx3fsckvm

Sentence Guided Temporal Modulation for Dynamic Video Thumbnail Generation [article]

Mrigank Rochan, Mahesh Kumar Krishna Reddy, Yang Wang
2020 arXiv   pre-print
In this paper, we propose a sentence guided temporal modulation (SGTM) mechanism that utilizes the sentence embedding to modulate the normalized temporal activations of the video thumbnail generation network  ...  We consider the problem of sentence specified dynamic video thumbnail generation.  ...  We thank NVIDIA for donating some of the GPUs used in this work.  ... 
arXiv:2008.13362v1 fatcat:b7xp3nuhwrd7dldbtbpd54rgki

Sentence Specified Dynamic Video Thumbnail Generation

Yitian Yuan, Lin Ma, Wenwu Zhu
2019 Proceedings of the 27th ACM International Conference on Multimedia - MM '19  
information, based on which a temporal conditioned pointer network is then introduced to sequentially generate the sentence specified video thumbnails.  ...  In this paper, we define a distinctively new task, namely sentence specified dynamic video thumbnail generation, where the generated thumbnails not only provide a concise preview of the original video  ...  In order to generate corresponding ground truth for temporal sentence localization in our created dataset, for each sentence query, we merge each group of continuous annotated video clips into a video  ... 
doi:10.1145/3343031.3350985 dblp:conf/mm/YuanM019 fatcat:62i2nctjmjc7ja7metv4pa7k2u

Discriminative Latent Semantic Graph for Video Captioning [article]

Yang Bai, Junyan Wang, Yang Long, Bingzhang Hu, Yang Song, Maurice Pagnucco, Yu Guan
2021 arXiv   pre-print
Our main contribution is to identify three key problems in a joint framework for future video summarization tasks. 1) Enhanced Object Proposal: we propose a novel Conditional Graph that can fuse spatio-temporal  ...  information into latent object proposal. 2) Visual Knowledge: Latent Proposal Aggregation is proposed to dynamically extract visual words with higher semantic levels. 3) Sentence Validation: A novel Discriminative  ...  Our dynamic graph in the Latent Proposal Aggregation module is able to extract high-level latent semantic concepts without an external dataset for training.  ... 
arXiv:2108.03662v1 fatcat:6okzuqntjngcvl7cndxybjnjje

A Survey on Temporal Sentence Grounding in Videos [article]

Xiaohan Lan, Yitian Yuan, Xin Wang, Zhi Wang, Wenwu Zhu
2021 arXiv   pre-print
Temporal sentence grounding in videos(TSGV), which aims to localize one target segment from an untrimmed video with respect to a given sentence query, has drawn increasing attentions in the research community  ...  To the best of our knowledge, this is the first systematic survey on temporal sentence grounding.  ...  video contents in a temporal convolution procedure, dynamically modulating the temporal feature maps concerning the sentence.  ... 
arXiv:2109.08039v2 fatcat:6ja4csssjzflhj426eggaf77tu

Move Forward and Tell: A Progressive Generator of Video Descriptions [article]

Yilei Xiong, Bo Dai, Dahua Lin
2018 arXiv   pre-print
Given a video, it selects a sequence of distinctive clips and generates sentences thereon in a coherent manner.  ...  They typically treat an entire video as a whole and generate the caption conditioned on a single embedding.  ...  As mentioned, the network takes in the ground-truth events of a video sequentially, producing one sentence for each event (conditioned on the previous state).  ... 
arXiv:1807.10018v1 fatcat:xhay72cq4fhzjntpsgoveomvfy

Move Forward and Tell: A Progressive Generator of Video Descriptions [chapter]

Yilei Xiong, Bo Dai, Dahua Lin
2018 Lecture Notes in Computer Science  
Given a video, it selects a sequence of distinctive clips and generates sentences thereon in a coherent manner.  ...  They typically treat an entire video as a whole and generate the caption conditioned on a single embedding.  ...  As mentioned, the network takes in the ground-truth events of a video sequentially, producing one sentence for each event (conditioned on the previous state).  ... 
doi:10.1007/978-3-030-01252-6_29 fatcat:twbxfif36rgtrcxa2m3wwbo2uu

Visual-aware Attention Dual-stream Decoder for Video Captioning [article]

Zhixin Sun, Xian Zhong, Shuqin Chen, Lin Li, Luo Zhong
2021 arXiv   pre-print
Video captioning is a challenging task that captures different visual parts and describes them in sentences, for it requires visual and linguistic coherence.  ...  This may not explicitly model the correlation and the temporal coherence of the visual features extracted in the sequence frames.To generate semantically coherent sentences, we propose a new Visual-aware  ...  INTRODUCTION Video Captioning is the task of generating a meaningful natural sentence for a given video.  ... 
arXiv:2110.08578v1 fatcat:gsa6o75oqrgo3b3c2gxdzgt5ti

Weakly-Supervised Multi-Level Attentional Reconstruction Network for Grounding Textual Queries in Videos [article]

Yijun Song, Jingwen Wang, Lin Ma, Zhou Yu, Jun Yu
2020 arXiv   pre-print
The task of temporally grounding textual queries in videos is to localize one video segment that semantically corresponds to the given query.  ...  Most of the existing approaches rely on segment-sentence pairs (temporal annotations) for training, which are usually unavailable in real-world scenarios.  ...  Based on SSD framework [20] , [29] proposed a semantic conditioned dynamic modulation (SCDM) mechanism to correlate and compose the sentence-specific video content over time.  ... 
arXiv:2003.07048v1 fatcat:fcepouqkvves7op25a4vpwidmm

Object-Aware Multi-Branch Relation Networks for Spatio-Temporal Video Grounding [article]

Zhu Zhang, Zhou Zhao, Zhijie Lin, Baoxing Huai, Nicholas Jing Yuan
2020 arXiv   pre-print
Currently, most existing grounding methods are restricted to well-aligned segment-sentence pairs. In this paper, we explore spatio-temporal video grounding on unaligned data and multi-form sentences.  ...  Spatio-temporal video grounding aims to retrieve the spatio-temporal tube of a queried object according to the given sentence.  ...  [Yuan et al., 2019] propose a semantic conditioned dynamic modulation for better correlating video contents over time and [Zhang et al., 2019b] adopt a 2D temporal map to cover diverse moments with  ... 
arXiv:2008.06941v2 fatcat:kn72wjmwivfhbpiwmnnqkd4i64

Object-Aware Multi-Branch Relation Networks for Spatio-Temporal Video Grounding

Zhu Zhang, Zhou Zhao, Zhijie Lin, Baoxing Huai, Jing Yuan
2020 Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence  
Currently, most existing grounding methods are restricted to well-aligned segment-sentence pairs. In this paper, we explore spatio-temporal video grounding on unaligned data and multi-form sentences.  ...  Spatio-temporal video grounding aims to retrieve the spatio-temporal tube of a queried object according to the given sentence.  ...  ., 2019] propose a semantic conditioned dynamic modulation for better correlating video contents over time and [Zhang et al., 2019b] adopt a 2D temporal map to cover diverse moments with different lengths  ... 
doi:10.24963/ijcai.2020/149 dblp:conf/ijcai/ZhangZLHY20 fatcat:4yux7bufpzeqpeexjwfux4tubq

Bidirectional Attentive Fusion with Context Gating for Dense Video Captioning

Jingwen Wang, Wenhao Jiang, Lin Ma, Wei Liu, Yong Xu
2018 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition  
First, previous works predominantly generate temporal event proposals in the forward direction, which neglects future video context.  ...  Dense video captioning is a newly emerging task that aims at both localizing and describing all events in a video.  ...  While aforementioned captioning methods generate only one sentence for the input video, video paragraph generation focuses on producing multiple semantics-fluent sentences.  ... 
doi:10.1109/cvpr.2018.00751 dblp:conf/cvpr/WangJ00X18 fatcat:l3mc7jrzhna4djoe6okz4x7vdy

LoGAN: Latent Graph Co-Attention Network for Weakly-Supervised Video Moment Retrieval [article]

Reuben Tan, Huijuan Xu, Kate Saenko, Bryan A. Plummer
2020 arXiv   pre-print
The goal of weakly-supervised video moment retrieval is to localize the video segment most relevant to the given natural language query without access to temporal annotations during training.  ...  Prior strongly- and weakly-supervised approaches often leverage co-attention mechanisms to learn visual-semantic representations for localization.  ...  Conditioned on the semantic and visual information from the visual-semantic representations, WCVG performs multiple iterations of message-passing where it dynamically weighs the relevance of other frames  ... 
arXiv:1909.13784v2 fatcat:btgosisk6bb4pklnwgpkojk53m

Text-based Localization of Moments in a Video Corpus [article]

Sudipta Paul, Niluthpol Chowdhury Mithun, Amit K. Roy-Chowdhury
2021 arXiv   pre-print
Prior works on text-based video moment localization focus on temporally grounding the textual query in an untrimmed video.  ...  In addition to learning subtle differences between intra-video moments, HMAN focuses on distinguishing inter-video global semantic concepts based on sentence queries.  ...  Semantic Conditioned Dynamic Modulation (SCDM) was proposed in [13] for correlating sentence and related video contents.  ... 
arXiv:2008.08716v2 fatcat:s3epp3qmijgsdirktv3idcu7n4

Context Modulated Dynamic Networks for Actor and Action Video Segmentation with Language Queries

Hao Wang, Cheng Deng, Fan Ma, Yi Yang
2020 PROCEEDINGS OF THE THIRTIETH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE AND THE TWENTY-EIGHTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE  
To address such limitation, we construct a context modulated dynamic convolutional network. Specifically, we propose a context modulated dynamic convolutional operation in the proposed framework.  ...  Previous methods mainly leverage dynamic convolutional networks to match visual and semantic representations.  ...  In this way, we can integrate the temporal features into the context modulated convolution to segment the target object in videos.  ... 
doi:10.1609/aaai.v34i07.6895 fatcat:s7vffkona5h6pkbpq2cty5j2wu
« Previous Showing results 1 — 15 out of 4,413 results