9,849 Hits in 7.9 sec

A Survey on Temporal Sentence Grounding in Videos [article]

Xiaohan Lan, Yitian Yuan, Xin Wang, Zhi Wang, Wenwu Zhu
2021 arXiv   pre-print
Temporal sentence grounding in videos(TSGV), which aims to localize one target segment from an untrimmed video with respect to a given sentence query, has drawn increasing attentions in the research community  ...  Different from the task of temporal action localization, TSGV is more flexible since it can locate complicated activities via natural languages, without restrictions from predefined action categories.  ...  [59] propose a structured multi-level interaction network (SMIN), which makes further modifications on the 2D temporal feature map as its proposal generation module.  ... 
arXiv:2109.08039v2 fatcat:6ja4csssjzflhj426eggaf77tu

MAN: Moment Alignment Network for Natural Language Moment Retrieval via Iterative Graph Adjustment [article]

Da Zhang, Xiyang Dai, Xin Wang, Yuan-Fang Wang, Larry S. Davis
2019 arXiv   pre-print
This research strives for natural language moment retrieval in long, untrimmed video streams.  ...  In this paper, we present Moment Alignment Network (MAN), a novel framework that unifies the candidate moment encoding and temporal structural reasoning in a single-shot feed-forward network.  ...  Moreover, adding more fine-grained word-level interactions between video and language can further improve the performance. Iterative graph adjustment network.  ... 
arXiv:1812.00087v2 fatcat:cbxtybz4cnf3xbqiudlyj3rlm4

MAN: Moment Alignment Network for Natural Language Moment Retrieval via Iterative Graph Adjustment

Da Zhang, Xiyang Dai, Xin Wang, Yuan-Fang Wang, Larry S. Davis
2019 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)  
This research strives for natural language moment retrieval in long, untrimmed video streams.  ...  In this paper, we present Moment Alignment Network (MAN), a novel framework that unifies the candidate moment encoding and temporal structural reasoning in a single-shot feed-forward network.  ...  Conclusion We have presented MAN, a Moment Alignment Network that unifies candidate moment encoding and temporal structural reasoning in a single-shot structure for natural language moment retrieval.  ... 
doi:10.1109/cvpr.2019.00134 dblp:conf/cvpr/ZhangDWWD19 fatcat:mbglkapzw5hr5aoxrz6msvll5m

Progressive Localization Networks for Language-based Moment Localization [article]

Qi Zheng, Jianfeng Dong, Xiaoye Qu, Xun Yang, Yabing Wang, Pan Zhou, Baolong Liu, Xun Wang
2022 arXiv   pre-print
This paper targets the task of language-based video moment localization.  ...  Extensive experiments on three public datasets demonstrate the effectiveness of our proposed PLN for language-based moment localization, especially for localizing short moments in long videos.  ...  [78] devise a multi-level interaction module to fuse video and text using hierarchical feature maps.  ... 
arXiv:2102.01282v2 fatcat:7i2dm6t2y5bzzjgoz6wmd67bja

Cascaded MPN: Cascaded Moment Proposal Network for Video Corpus Moment Retrieval

Sunjae Yoon, Dahyun Kim, Junyeong Kim, Chang D. Yoo
2022 IEEE Access  
INDEX TERMS Video corpus moment retrieval, cascaded moment proposal, multi-modal interaction, vision-language system.  ...  Video corpus moment retrieval aims to localize temporal moments corresponding to textual query in a large video corpus.  ...  performs moment retrieval via devising cascaded multi-modal feature interaction among anchor-free and anchor-based video semantics.  ... 
doi:10.1109/access.2022.3183106 fatcat:4yxtpdaspnfrxpwgbitksd4v4u

The Elements of Temporal Sentence Grounding in Videos: A Survey and Future Directions [article]

Hao Zhang, Aixin Sun, Wei Jing, Joey Tianyi Zhou
2022 arXiv   pre-print
Temporal sentence grounding in videos (TSGV), a.k.a., natural language video localization (NLVL) or video moment retrieval (VMR), aims to retrieve a temporal moment that semantically corresponds to a language  ...  As the background, we present a common structure of functional components in TSGV, in a tutorial style: from feature extraction from raw video and language query, to answer prediction of the target moment  ...  These solutions design various cross-modal reasoning strategies to perform more fine-grained and deeper multi-modal interaction between video and query, for precise moment localization.  ... 
arXiv:2201.08071v1 fatcat:2k2if6dsyveinec2dmmujcmhkq

Fine-grained Iterative Attention Network for TemporalLanguage Localization in Videos [article]

Xiaoye Qu, Pengwei Tang, Zhikang Zhou, Yu Cheng, Jianfeng Dong, Pan Zhou
2020 arXiv   pre-print
Finally, both video and query information is utilized to provide robust cross-modal representation for further moment localization.  ...  Temporal language localization in videos aims to ground one video segment in an untrimmed video based on a given sentence query.  ...  CONCLUSIONS In this paper, we propose a novel Fine-grained Iterative Attention Network (FIAN) to adequately extract bilateral video-query interaction information for temporal language localization in videos  ... 
arXiv:2008.02448v1 fatcat:lohkhfoiufd2ngc2ak737ha75y

Language Guided Networks for Cross-modal Moment Retrieval [article]

Kun Liu, Huadong Ma, Chuang Gan
2020 arXiv   pre-print
We address the challenging task of cross-modal moment retrieval, which aims to localize a temporal segment from an untrimmed video described by a natural language query.  ...  Specifically, the late guidance module is developed to linearly transform the output of localization networks via the channel attention mechanism.  ...  As shown in Figure 1 , given a natural language query and an untrimmed video, the goal is to localize the start and end time of the moments described by the given sentence query.  ... 
arXiv:2006.10457v2 fatcat:eqnjjsvpvfc2darsrkayd3qvxm

Jointly Cross- and Self-Modal Graph Attention Network for Query-Based Moment Localization [article]

Daizong Liu, Xiaoye Qu, Xiao-Yang Liu, Jianfeng Dong, Pan Zhou, Zichuan Xu
2020 arXiv   pre-print
Query-based moment localization is a new task that localizes the best matched segment in an untrimmed video according to a given sentence query.  ...  Through parametric message passing, CMG highlights relevant instances across video and sentence, and then SMG models the pairwise relation inside each modality for frame (word) correlating.  ...  Query-based moment localization in videos.  ... 
arXiv:2008.01403v2 fatcat:ii2d4xab2vhfza54juiapdkw4i

A Survey on Natural Language Video Localization [article]

Xinfang Liu, Xiushan Nie, Zhifang Tan, Jie Guo, Yilong Yin
2021 arXiv   pre-print
Natural language video localization (NLVL), which aims to locate a target moment from a video that semantically corresponds to a text query, is a novel and challenging task.  ...  INTRODUCTION Given a video and a query sentence described in natural language form, natural language video localization (NLVL) aims at finding the segment from the video that is relevant to the query description  ...  Therefore, they proposed an Interaction-Integrated Network, where the network is able to capture long-range video structure information by overlaying Interaction-Integrated Cells, which is a module that  ... 
arXiv:2104.00234v1 fatcat:zuqg6fn6mjafbf3zwqyslmauhy

Dual-Channel Localization Networks for Moment Retrieval with Natural Language

Bolin Zhang, Bin Jiang, Chao Yang, Liang Pang
2022 Proceedings of the 2022 International Conference on Multimedia Retrieval  
According to the given natural language query, moment retrieval aims to localize the most relevant moment in an untrimmed video.  ...  the matching degree between natural language query and video moments.  ...  CONCLUSION AND FUTURE WORK This paper proposes an uncomplicated and efficient Dual-Channel Localization Network (DCLN) to localize the desired moment via a given query.  ... 
doi:10.1145/3512527.3531394 fatcat:o26re26zm5gj5evyypnr2vnvl4

Adaptive Hierarchical Graph Reasoning with Semantic Coherence for Video-and-Language Inference [article]

Juncheng Li, Siliang Tang, Linchao Zhu, Haochen Shi, Xuanwen Huang, Fei Wu, Yi Yang, Yueting Zhuang
2021 arXiv   pre-print
Video-and-Language Inference is a recently proposed task for joint video-and-language understanding.  ...  First, we propose an adaptive hierarchical graph network that achieves in-depth understanding of the video over complex interactions.  ...  7) XML: Cross-modal Moment Localization (XML) modular network [29] is a recently proposed transformer-based method for TV show retrieval. 8) HERO: a transformer-based framework [32] for video-and-language  ... 
arXiv:2107.12270v2 fatcat:jzfz6lwztrfpxpnocx4dx72eoq

Multi-Scale 2D Temporal Adjacent Networks for Moment Localization with Natural Language [article]

Songyang Zhang, Houwen Peng, Jianlong Fu, Yijuan Lu, Jiebo Luo
2021 arXiv   pre-print
Based on the 2D temporal maps, we propose a Multi-Scale Temporal Adjacent Network (MS-2D-TAN), a single-shot framework for moment localization.  ...  We address the problem of retrieving a specific moment from an untrimmed video by natural language.  ...  The extracted feature encodes the language structure of the query sentence, thus describes the moment of interest.  ... 
arXiv:2012.02646v2 fatcat:yiq7hnoflbc3baaoj4jkg3kzim

Video Question Answering: Datasets, Algorithms and Challenges [article]

Yaoyao Zhong, Wei Ji, Junbin Xiao, Yicong Li, Weihong Deng, Tat-Seng Chua
2022 arXiv   pre-print
Video Question Answering (VideoQA) aims to answer natural language questions according to the given videos.  ...  We then point out the research trend of studying beyond factoid QA to inference QA towards the cognition of video contents, Finally, we conclude some promising directions for future exploration.  ...  Video question answering (VideoQA) is one of the most prominent, given its promise to develop interactive AI to communicate with the dynamic visual world via natural language.  ... 
arXiv:2203.01225v1 fatcat:dn4sz5pomnfb7igvmxofangzsa

Weakly-Supervised Multi-Level Attentional Reconstruction Network for Grounding Textual Queries in Videos [article]

Yijun Song, Jingwen Wang, Lin Ma, Zhou Yu, Jun Yu
2020 arXiv   pre-print
The task of temporally grounding textual queries in videos is to localize one video segment that semantically corresponds to the given query.  ...  In this work we present an effective weakly-supervised model, named as Multi-Level Attentional Reconstruction Network (MARN), which only relies on video-sentence pairs during the training stage.  ...  TGA [21] : utilizes clip-level alignment by text guided attention. WSLLN [8] : a weakly-supervised language localization network. wMAN [23] : a weakly-supervised moment alignment network.  ... 
arXiv:2003.07048v1 fatcat:fcepouqkvves7op25a4vpwidmm
« Previous Showing results 1 — 15 out of 9,849 results