A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2020; you can also visit the original URL.
The file type is application/pdf
.
Filters
Localizing Moments in Video with Natural Language
[article]
2017
arXiv
pre-print
We consider retrieving a specific temporal segment, or moment, from a video given a natural language text description. ...
Methods designed to retrieve whole video clips with natural language determine what occurs in a video but not when. ...
Only comparing moments within a single video means the model must learn to differentiate between subtle differences without learning how to differentiate between broader semantic concepts (e.g., "girl" ...
arXiv:1708.01641v1
fatcat:sgrv3qlhhfaujh6szkoxgwgmqa
QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries
[article]
2021
arXiv
pre-print
We also present a strong baseline for this task, Moment-DETR, a transformer encoder-decoder model that views moment retrieval as a direct set prediction problem, taking extracted video and query representations ...
Each video in the dataset is annotated with: (1) a human-written free-form NL query, (2) relevant moments in the video w.r.t. the query, and (3) five-point scale saliency scores for all query-relevant ...
QVHIGHLIGHTS can have multiple disjoint moments paired with a single query (on average 1.8 moments per query in a video), while all the moment retrieval datasets can only have a single moment. ...
arXiv:2107.09609v2
fatcat:wroc3zg6ufbtzcird72vyq6dfa
Weakly-Supervised Video Moment Retrieval via Semantic Completion Network
[article]
2020
arXiv
pre-print
In this paper, we propose a novel weakly-supervised moment retrieval framework requiring only coarse video-level annotations for training. ...
Video moment retrieval is to search the moment that is most relevant to the given natural language query. ...
This motivates us to develop a weakly-supervised method for moment retrieval that needs only coarse video-level annotations for training. ...
arXiv:1911.08199v3
fatcat:7vwjsnr6cza7fj74rifxd22sdm
AssistSR: Task-oriented Question-driven Video Segment Retrieval
[article]
2022
arXiv
pre-print
Video Question Answering, Video Retrieval, Moment Localization) are often factoid and based on pure text. ...
In contrast, we present a new task called Task-oriented Question-driven Video Segment Retrieval (TQVSR). ...
only have one single moment or one single video. ...
arXiv:2111.15050v3
fatcat:mhqwl54piffp5j4v5wnnqktafq
Weakly Supervised Video Moment Retrieval From Text Queries
[article]
2019
arXiv
pre-print
In order to cope with this issue, in this work, we introduce the problem of learning from weak labels for the task of text to video moment retrieval. ...
There have been a few recent methods proposed in text to video moment retrieval using natural language queries, but requiring full supervision during training. ...
This work was partially supported by NSF grant 1544969 and ONR contract N00014-15-C5113 through a sub-contract from Mayachitra Inc. ...
arXiv:1904.03282v2
fatcat:5qithwolavfwpofawe232b6pzi
Weakly Supervised Video Moment Retrieval From Text Queries
2019
2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
In order to cope with this issue, in this work, we introduce the problem of learning from weak labels for the task of text to video moment retrieval. ...
There have been a few recent methods proposed in text to video moment retrieval using natural language queries, but requiring full supervision during training. ...
This work was partially supported by NSF grant 1544969 and ONR contract N00014-15-C5113 through a sub-contract from Mayachitra Inc. ...
doi:10.1109/cvpr.2019.01186
dblp:conf/cvpr/MithunPR19
fatcat:fv7y4dhnxrhvjdrf4c2w7mm5nm
Uncovering Hidden Challenges in Query-Based Video Moment Retrieval
[article]
2020
arXiv
pre-print
The query-based moment retrieval is a problem of localising a specific clip from an untrimmed video according a query sentence. ...
Like in many other areas in computer vision and machine learning, the progress in query-based moment retrieval is heavily driven by the benchmark datasets and, therefore, their quality has significant ...
We asked the annotators to work on the moment retrieval task, where a query sentence and a video were displayed to an annotator, and the annotator marked the start and end times of a moment that corresponds ...
arXiv:2009.00325v2
fatcat:5o5fb5hvrzg6pnizdgr3dge3xq
Multi-scale 2D Representation Learning for weakly-supervised moment retrieval
[article]
2021
arXiv
pre-print
To cope with this issue, we propose a Multi-scale 2D Representation Learning method for weakly supervised video moment retrieval. ...
Video moment retrieval aims to search the moment most relevant to a given language query. ...
Inspired by the success of the weakly-supervised temporal action detection, a small number of works are proposed to retrieve best-matching video moment without annotations of temporal boundaries. ...
arXiv:2111.02741v1
fatcat:fmvmp2k3xvcjlp3d6fahgiqkiq
HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training
[article]
2020
arXiv
pre-print
We present HERO, a novel framework for large-scale video+language omni-representation learning. ...
Comprehensive experiments demonstrate that HERO achieves new state of the art on multiple benchmarks over Text-based Video/Video-moment Retrieval, Video Question Answering (QA), Video-and-language Inference ...
Anne Hendricks et al. (2017b) and Gao et al. (2017) introduce the task of Single Video Moment Retrieval (SVMR), which aims at retrieving a moment from a single video via a natural language query. ...
arXiv:2005.00200v2
fatcat:skm6ktfgq5hpzhdsbmrajkbjcq
MTVR: Multilingual Moment Retrieval in Videos
[article]
2021
arXiv
pre-print
We introduce mTVR, a large-scale multilingual video moment retrieval dataset, containing 218K English and Chinese queries from 21.8K TV show video clips. ...
We further propose mXML, a multilingual moment retrieval model that learns and operates on data from both languages, via encoder parameter sharing and language neighborhood constraints. ...
., 2020) introduced the Video Corpus Moment Retrieval (VCMR) task: given a natural language query, a system needs to retrieve a short moment from a large video corpus. ...
arXiv:2108.00061v1
fatcat:kcn4quanr5hs3kukm2s3y5wziu
VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation
[article]
2021
arXiv
pre-print
Most existing video-and-language (VidL) research focuses on a single dataset, or multiple datasets of a single task. ...
To facilitate the evaluation of such systems, we introduce Video-And-Language Understanding Evaluation (VALUE) benchmark, an assemblage of 11 VidL datasets over 3 popular tasks: (i) text-to-video retrieval ...
For a given video and the start/end points of a moment of the video, a model must generate a description for the video moment with/without leveraging the information from the entire video. ...
arXiv:2106.04632v2
fatcat:zszcuqp6rjexjokioto5riwepy
Weakly-Supervised Video Moment Retrieval via Semantic Completion Network
2020
PROCEEDINGS OF THE THIRTIETH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE AND THE TWENTY-EIGHTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE
In this paper, we propose a novel weakly-supervised moment retrieval framework requiring only coarse video-level annotations for training. ...
Video moment retrieval is to search the moment that is most relevant to the given natural language query. ...
This motivates us to develop a weakly-supervised method for moment retrieval that needs only coarse video-level annotations for training. ...
doi:10.1609/aaai.v34i07.6820
fatcat:zveh5blsg5ehvapv2aes7unvye
TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval
[article]
2020
arXiv
pre-print
Further, we present several baselines and a novel Cross-modal Moment Localization (XML ) network for multimodal moment retrieval tasks. ...
We introduce TV show Retrieval (TVR), a new multimodal retrieval dataset. TVR requires systems to understand both videos and their associated subtitle (dialogue) texts, making it more realistic. ...
(5)
Single Video Moment Retrieval. ...
arXiv:2001.09099v2
fatcat:npokf5n7tbca7bf6a44shlnlim
Weak Supervision and Referring Attention for Temporal-Textual Association Learning
[article]
2020
arXiv
pre-print
However, training such a system in a fully supervised way inevitably demands a meticulously curated video dataset with temporal-textual annotations. ...
queries compared to the single video, and 3) cross-video visual similarities. ...
in the wedding" (for moment retrieval) or "a man in yellow shirt appearing in the hall in last night" (for video surveillance). ...
arXiv:2006.11747v2
fatcat:bpqa6chthfgjhatmsgqq5t2dym
Text-based Localization of Moments in a Video Corpus
[article]
2021
arXiv
pre-print
This task poses a unique challenge as the system is required to perform: (i) retrieval of the relevant video where only a segment of the video corresponds with the queried sentence, and (ii) temporal localization ...
on the proposed task of temporal localization of moments in a corpus of videos. ...
It is more likely that a user would need to retrieve a moment from a large corpus of videos given a sentence query. ...
arXiv:2008.08716v2
fatcat:s3epp3qmijgsdirktv3idcu7n4
« Previous
Showing results 1 — 15 out of 9,943 results