Filters








21,572 Hits in 6.4 sec

Temporally Grounding Natural Sentence in Video

Jingyuan Chen, Xinpeng Chen, Lin Ma, Zequn Jie, Tat-Seng Chua
2018 Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing  
We introduce an effective and efficient method that grounds (i.e., localizes) natural sentences in long, untrimmed video sequences.  ...  Specifically, a novel Temporal GroundNet (TGN) 1 is proposed to temporally capture the evolving fine-grained frame-by-word interactions between video and sentence.  ...  Grounding Natural Language in Video Analogous to spatial grounding in image, this work studies a similar problem-temporal natural language grounding in video.  ... 
doi:10.18653/v1/d18-1015 dblp:conf/emnlp/ChenCMJC18 fatcat:2fkaspfhhjcvlosu53o5amkfci

Weakly-Supervised Spatio-Temporally Grounding Natural Sentence in Video

Zhenfang Chen, Lin Ma, Wenhan Luo, Kwan-Yee Kenneth Wong
2019 Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics  
In this paper, we address a novel task, namely weakly-supervised spatio-temporally grounding natural sentence in video.  ...  Specifically, given a natural sentence and a video, we localize a spatio-temporal tube in the video that semantically corresponds to the given sentence, with no reliance on any spatio-temporal annotations  ...  Conclusion In this paper, we introduced a new task, namely weakly-supervised spatio-temporally grounding natural sentence in video.  ... 
doi:10.18653/v1/p19-1183 dblp:conf/acl/ChenMLW19 fatcat:6wf2b5kzgzgrtf5pvxcwqua3dy

Explore-And-Match: Bridging Proposal-Based and Proposal-Free With Transformer for Sentence Grounding in Videos [article]

Sangmin Woo, Jinyoung Park, Inyong Koo, Sumin Lee, Minki Jeong, Changick Kim
2022 arXiv   pre-print
Natural Language Video Grounding (NLVG) aims to localize time segments in an untrimmed video according to sentence queries.  ...  search space to find time segments directly, and the latter matches the predefined time segments with ground truths.  ...  The transformer is unable to preserve the order of temporally arranged video features due to the permutation-invariant nature of the architecture.  ... 
arXiv:2201.10168v4 fatcat:xszilg3hcrfjjcetlahur5tvue

Semantic Conditioned Dynamic Modulation for Temporal Sentence Grounding in Videos [article]

Yitian Yuan, Lin Ma, Jingwen Wang, Wei Liu, Wenwu Zhu
2019 arXiv   pre-print
Temporal sentence grounding in videos aims to detect and localize one target video segment, which semantically corresponds to a given sentence.  ...  contents for temporal sentence grounding.  ...  Related Works Temporal sentence grounding in videos is a new task introduced recently [10, 14] .  ... 
arXiv:1910.14303v1 fatcat:mwmqqmbjhbevfhi4gjx3fsckvm

Localizing Moments in Video with Temporal Language [article]

Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, Bryan Russell
2018 arXiv   pre-print
Localizing moments in a longer video via natural language queries is a new, challenging task at the intersection of language and video understanding.  ...  consists of temporal sentences annotated by humans (TEMPO - Human Language).  ...  In addition, unlike the MarioQA dataset, that consists of synthetic data constructed from gameplay videos, our dataset consists of real visual inputs, and includes temporal grounding of natural language  ... 
arXiv:1809.01337v1 fatcat:zybfwpgby5bmtpuvjajtoxdoz4

Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form Sentences [article]

Zhu Zhang, Zhou Zhao, Yang Zhao, Qi Wang, Huasheng Liu, Lianli Gao
2020 arXiv   pre-print
In this paper, we consider a novel task, Spatio-Temporal Video Grounding for Multi-Form Sentences (STVG).  ...  STVG has two challenging settings: (1) We need to localize spatio-temporal object tubes from untrimmed videos, where the object may only exist in a very small segment of the video; (2) We deal with multi-form  ...  Related Work Temporal Localization via Natural Language Temporal natural language localization is to detect the video clip depicting the given sentence.  ... 
arXiv:2001.06891v3 fatcat:df3uigkdrzbxdnfcze3ydpicpi

Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form Sentences

Zhu Zhang, Zhou Zhao, Yang Zhao, Qi Wang, Huasheng Liu, Lianli Gao
2020 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)  
In this paper, we consider a novel task, Spatio-Temporal Video Grounding for Multi-Form Sentences (STVG).  ...  STVG has two challenging settings: (1) We need to localize spatio-temporal object tubes from untrimmed videos, where the object may only exist in a very small segment of the video; (2) We deal with multi-form  ...  Related Work Temporal Localization via Natural Language Temporal natural language localization is to detect the video clip depicting the given sentence.  ... 
doi:10.1109/cvpr42600.2020.01068 dblp:conf/cvpr/ZhangZZWLG20 fatcat:umcnf6qsajcezfx6k2sa2a6t5e

Look Closer to Ground Better: Weakly-Supervised Temporal Grounding of Sentence in Video [article]

Zhenfang Chen, Lin Ma, Wenhan Luo, Peng Tang, Kwan-Yee K. Wong
2020 arXiv   pre-print
In this paper, we study the problem of weakly-supervised temporal grounding of sentence in video.  ...  Specifically, given an untrimmed video and a query sentence, our goal is to localize a temporal segment in the video that semantically corresponds to the query sentence, with no reliance on any temporal  ...  Introduction Given a natural sentence and an untrimmed video, temporal video grounding [Gao et al., 2017; Hendricks et al., 2017] aims to determine the start and end timestamps of one segment in the  ... 
arXiv:2001.09308v1 fatcat:z2h6kx2j4rg2xbfq4sopn6dmy4

Object-Aware Multi-Branch Relation Networks for Spatio-Temporal Video Grounding [article]

Zhu Zhang, Zhou Zhao, Zhijie Lin, Baoxing Huai, Nicholas Jing Yuan
2020 arXiv   pre-print
Currently, most existing grounding methods are restricted to well-aligned segment-sentence pairs. In this paper, we explore spatio-temporal video grounding on unaligned data and multi-form sentences.  ...  Spatio-temporal video grounding aims to retrieve the spatio-temporal tube of a queried object according to the given sentence.  ...  Acknowledgments This work was supported by the National Key R&D Program of China under Grant No. 2018AAA0100603, Zhejiang Natural Science Foundation LR19F020006 and the National Natural Science Foundation  ... 
arXiv:2008.06941v2 fatcat:kn72wjmwivfhbpiwmnnqkd4i64

Localizing Moments in Video with Temporal Language

Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, Bryan Russell
2018 Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing  
Localizing moments in a longer video via natural language queries is a new, challenging task at the intersection of language and video understanding.  ...  consists of temporal sentences annotated by humans (TEMPO -Human Language).  ...  In addition, unlike the MarioQA dataset, that consists of synthetic data constructed from gameplay videos, our dataset consists of real visual inputs, and includes temporal grounding of natural language  ... 
doi:10.18653/v1/d18-1168 dblp:conf/emnlp/HendricksWSSDR18 fatcat:3tmvsxjozzd7xbuy4nhwaauzru

Zero-shot Natural Language Video Localization [article]

Jinwoo Nam and Daechul Ahn and Dongyeop Kang and Seong Jong Ha and Jonghyun Choi
2021 arXiv   pre-print
To eliminate the annotation costs, we make a first attempt to train a natural language video localization model in zero-shot manner.  ...  Understanding videos to localize moments with natural language often requires large expensive annotated video regions paired with language queries.  ...  The task targets to localize a temporal moment in a video by a natural language query.  ... 
arXiv:2110.00428v1 fatcat:ca7uj2vn7za6bixhqhgh2x3ax4

Human-centric Spatio-Temporal Video Grounding With Visual Transformers [article]

Zongheng Tang, Yue Liao, Si Liu, Guanbin Li, Xiaojie Jin, Hongxu Jiang, Qian Yu, Dong Xu
2021 arXiv   pre-print
In this work, we introduce a novel task - Humancentric Spatio-Temporal Video Grounding (HC-STVG).  ...  for video-sentence matching and temporal localization.  ...  Temporal Video Grounding The goal of temporal video grounding is localizing the most relevant video segment given a query sentence.  ... 
arXiv:2011.05049v2 fatcat:lfgpc7gsxvbbzdwqhhv3qgv4b4

Object-Aware Multi-Branch Relation Networks for Spatio-Temporal Video Grounding

Zhu Zhang, Zhou Zhao, Zhijie Lin, Baoxing Huai, Jing Yuan
2020 Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence  
Currently, most existing grounding methods are restricted to well-aligned segment-sentence pairs. In this paper, we explore spatio-temporal video grounding on unaligned data and multi-form sentences.  ...  Spatio-temporal video grounding aims to retrieve the spatio-temporal tube of a queried object according to the given sentence.  ...  Spatio-temporal video grounding is a natural extension of temporal grounding, which retrieves a spatio-temporal tube from a video corresponding to the sentence.  ... 
doi:10.24963/ijcai.2020/149 dblp:conf/ijcai/ZhangZLHY20 fatcat:4yux7bufpzeqpeexjwfux4tubq

Learning Sample Importance for Cross-Scenario Video Temporal Grounding [article]

Peijun Bao, Yadong Mu
2022 arXiv   pre-print
The task of temporal grounding aims to locate video moment in an untrimmed video, with a given sentence query.  ...  We evaluate the proposed model in cross-scenario temporal grounding, where the train / test data are heterogeneously sourced.  ...  There are 3,720 moment-sentence pairs in the testing set. DiDeMo. It was recently proposed in [Hendricks et al., 2018] , specially for natural language moment retrieval in open-world videos.  ... 
arXiv:2201.02848v1 fatcat:mge6vtradfaxdm2yds6rfhzloq

MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions [article]

Mattia Soldan, Alejandro Pardo, Juan León Alcázar, Fabian Caba Heilbron, Chen Zhao, Silvio Giancola, Bernard Ghanem
2022 arXiv   pre-print
MAD contains over 384,000 natural language sentences grounded in over 1,200 hours of videos and exhibits a significant reduction in the currently diagnosed biases for video-language grounding datasets.  ...  MAD's collection strategy enables a novel and more challenging version of video-language grounding, where short temporal moments (typically seconds long) must be accurately grounded in diverse long-form  ...  Each recognized word is associated with a temporal timestamp. At this step in our pipeline, we have sentences temporally grounded to the original video stream.  ... 
arXiv:2112.00431v2 fatcat:gmpn22jdsfb55laxiltelf35wq
« Previous Showing results 1 — 15 out of 21,572 results