Filters








9,943 Hits in 4.6 sec

Localizing Moments in Video with Natural Language [article]

Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, Bryan Russell
2017 arXiv   pre-print
We consider retrieving a specific temporal segment, or moment, from a video given a natural language text description.  ...  Methods designed to retrieve whole video clips with natural language determine what occurs in a video but not when.  ...  Only comparing moments within a single video means the model must learn to differentiate between subtle differences without learning how to differentiate between broader semantic concepts (e.g., "girl"  ... 
arXiv:1708.01641v1 fatcat:sgrv3qlhhfaujh6szkoxgwgmqa

QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries [article]

Jie Lei, Tamara L. Berg, Mohit Bansal
2021 arXiv   pre-print
We also present a strong baseline for this task, Moment-DETR, a transformer encoder-decoder model that views moment retrieval as a direct set prediction problem, taking extracted video and query representations  ...  Each video in the dataset is annotated with: (1) a human-written free-form NL query, (2) relevant moments in the video w.r.t. the query, and (3) five-point scale saliency scores for all query-relevant  ...  QVHIGHLIGHTS can have multiple disjoint moments paired with a single query (on average 1.8 moments per query in a video), while all the moment retrieval datasets can only have a single moment.  ... 
arXiv:2107.09609v2 fatcat:wroc3zg6ufbtzcird72vyq6dfa

Weakly-Supervised Video Moment Retrieval via Semantic Completion Network [article]

Zhijie Lin, Zhou Zhao, Zhu Zhang, Qi Wang, Huasheng Liu
2020 arXiv   pre-print
In this paper, we propose a novel weakly-supervised moment retrieval framework requiring only coarse video-level annotations for training.  ...  Video moment retrieval is to search the moment that is most relevant to the given natural language query.  ...  This motivates us to develop a weakly-supervised method for moment retrieval that needs only coarse video-level annotations for training.  ... 
arXiv:1911.08199v3 fatcat:7vwjsnr6cza7fj74rifxd22sdm

AssistSR: Task-oriented Question-driven Video Segment Retrieval [article]

Stan Weixian Lei, Yuxuan Wang, Dongxing Mao, Difei Gao, Mike Zheng Shou
2022 arXiv   pre-print
Video Question Answering, Video Retrieval, Moment Localization) are often factoid and based on pure text.  ...  In contrast, we present a new task called Task-oriented Question-driven Video Segment Retrieval (TQVSR).  ...  only have one single moment or one single video.  ... 
arXiv:2111.15050v3 fatcat:mhqwl54piffp5j4v5wnnqktafq

Weakly Supervised Video Moment Retrieval From Text Queries [article]

Niluthpol Chowdhury Mithun, Sujoy Paul, Amit K. Roy-Chowdhury
2019 arXiv   pre-print
In order to cope with this issue, in this work, we introduce the problem of learning from weak labels for the task of text to video moment retrieval.  ...  There have been a few recent methods proposed in text to video moment retrieval using natural language queries, but requiring full supervision during training.  ...  This work was partially supported by NSF grant 1544969 and ONR contract N00014-15-C5113 through a sub-contract from Mayachitra Inc.  ... 
arXiv:1904.03282v2 fatcat:5qithwolavfwpofawe232b6pzi

Weakly Supervised Video Moment Retrieval From Text Queries

Niluthpol Chowdhury Mithun, Sujoy Paul, Amit K. Roy-Chowdhury
2019 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)  
In order to cope with this issue, in this work, we introduce the problem of learning from weak labels for the task of text to video moment retrieval.  ...  There have been a few recent methods proposed in text to video moment retrieval using natural language queries, but requiring full supervision during training.  ...  This work was partially supported by NSF grant 1544969 and ONR contract N00014-15-C5113 through a sub-contract from Mayachitra Inc.  ... 
doi:10.1109/cvpr.2019.01186 dblp:conf/cvpr/MithunPR19 fatcat:fv7y4dhnxrhvjdrf4c2w7mm5nm

Uncovering Hidden Challenges in Query-Based Video Moment Retrieval [article]

Mayu Otani, Yuta Nakashima, Esa Rahtu, Janne Heikkilä
2020 arXiv   pre-print
The query-based moment retrieval is a problem of localising a specific clip from an untrimmed video according a query sentence.  ...  Like in many other areas in computer vision and machine learning, the progress in query-based moment retrieval is heavily driven by the benchmark datasets and, therefore, their quality has significant  ...  We asked the annotators to work on the moment retrieval task, where a query sentence and a video were displayed to an annotator, and the annotator marked the start and end times of a moment that corresponds  ... 
arXiv:2009.00325v2 fatcat:5o5fb5hvrzg6pnizdgr3dge3xq

Multi-scale 2D Representation Learning for weakly-supervised moment retrieval [article]

Ding Li, Rui Wu, Yongqiang Tang, Zhizhong Zhang, Wensheng Zhang
2021 arXiv   pre-print
To cope with this issue, we propose a Multi-scale 2D Representation Learning method for weakly supervised video moment retrieval.  ...  Video moment retrieval aims to search the moment most relevant to a given language query.  ...  Inspired by the success of the weakly-supervised temporal action detection, a small number of works are proposed to retrieve best-matching video moment without annotations of temporal boundaries.  ... 
arXiv:2111.02741v1 fatcat:fmvmp2k3xvcjlp3d6fahgiqkiq

HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training [article]

Linjie Li, Yen-Chun Chen, Yu Cheng, Zhe Gan, Licheng Yu, Jingjing Liu
2020 arXiv   pre-print
We present HERO, a novel framework for large-scale video+language omni-representation learning.  ...  Comprehensive experiments demonstrate that HERO achieves new state of the art on multiple benchmarks over Text-based Video/Video-moment Retrieval, Video Question Answering (QA), Video-and-language Inference  ...  Anne Hendricks et al. (2017b) and Gao et al. (2017) introduce the task of Single Video Moment Retrieval (SVMR), which aims at retrieving a moment from a single video via a natural language query.  ... 
arXiv:2005.00200v2 fatcat:skm6ktfgq5hpzhdsbmrajkbjcq

MTVR: Multilingual Moment Retrieval in Videos [article]

Jie Lei, Tamara L. Berg, Mohit Bansal
2021 arXiv   pre-print
We introduce mTVR, a large-scale multilingual video moment retrieval dataset, containing 218K English and Chinese queries from 21.8K TV show video clips.  ...  We further propose mXML, a multilingual moment retrieval model that learns and operates on data from both languages, via encoder parameter sharing and language neighborhood constraints.  ...  ., 2020) introduced the Video Corpus Moment Retrieval (VCMR) task: given a natural language query, a system needs to retrieve a short moment from a large video corpus.  ... 
arXiv:2108.00061v1 fatcat:kcn4quanr5hs3kukm2s3y5wziu

VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation [article]

Linjie Li, Jie Lei, Zhe Gan, Licheng Yu, Yen-Chun Chen, Rohit Pillai, Yu Cheng, Luowei Zhou, Xin Eric Wang, William Yang Wang, Tamara Lee Berg, Mohit Bansal (+3 others)
2021 arXiv   pre-print
Most existing video-and-language (VidL) research focuses on a single dataset, or multiple datasets of a single task.  ...  To facilitate the evaluation of such systems, we introduce Video-And-Language Understanding Evaluation (VALUE) benchmark, an assemblage of 11 VidL datasets over 3 popular tasks: (i) text-to-video retrieval  ...  For a given video and the start/end points of a moment of the video, a model must generate a description for the video moment with/without leveraging the information from the entire video.  ... 
arXiv:2106.04632v2 fatcat:zszcuqp6rjexjokioto5riwepy

Weakly-Supervised Video Moment Retrieval via Semantic Completion Network

Zhijie Lin, Zhou Zhao, Zhu Zhang, Qi Wang, Huasheng Liu
2020 PROCEEDINGS OF THE THIRTIETH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE AND THE TWENTY-EIGHTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE  
In this paper, we propose a novel weakly-supervised moment retrieval framework requiring only coarse video-level annotations for training.  ...  Video moment retrieval is to search the moment that is most relevant to the given natural language query.  ...  This motivates us to develop a weakly-supervised method for moment retrieval that needs only coarse video-level annotations for training.  ... 
doi:10.1609/aaai.v34i07.6820 fatcat:zveh5blsg5ehvapv2aes7unvye

TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval [article]

Jie Lei, Licheng Yu, Tamara L. Berg, Mohit Bansal
2020 arXiv   pre-print
Further, we present several baselines and a novel Cross-modal Moment Localization (XML ) network for multimodal moment retrieval tasks.  ...  We introduce TV show Retrieval (TVR), a new multimodal retrieval dataset. TVR requires systems to understand both videos and their associated subtitle (dialogue) texts, making it more realistic.  ...  (5) Single Video Moment Retrieval.  ... 
arXiv:2001.09099v2 fatcat:npokf5n7tbca7bf6a44shlnlim

Weak Supervision and Referring Attention for Temporal-Textual Association Learning [article]

Zhiyuan Fang, Shu Kong, Zhe Wang, Charless Fowlkes, Yezhou Yang
2020 arXiv   pre-print
However, training such a system in a fully supervised way inevitably demands a meticulously curated video dataset with temporal-textual annotations.  ...  queries compared to the single video, and 3) cross-video visual similarities.  ...  in the wedding" (for moment retrieval) or "a man in yellow shirt appearing in the hall in last night" (for video surveillance).  ... 
arXiv:2006.11747v2 fatcat:bpqa6chthfgjhatmsgqq5t2dym

Text-based Localization of Moments in a Video Corpus [article]

Sudipta Paul, Niluthpol Chowdhury Mithun, Amit K. Roy-Chowdhury
2021 arXiv   pre-print
This task poses a unique challenge as the system is required to perform: (i) retrieval of the relevant video where only a segment of the video corresponds with the queried sentence, and (ii) temporal localization  ...  on the proposed task of temporal localization of moments in a corpus of videos.  ...  It is more likely that a user would need to retrieve a moment from a large corpus of videos given a sentence query.  ... 
arXiv:2008.08716v2 fatcat:s3epp3qmijgsdirktv3idcu7n4
« Previous Showing results 1 — 15 out of 9,943 results