Filters








1,952 Hits in 8.8 sec

A Closer Look at Temporal Sentence Grounding in Videos: Datasets and Metrics [article]

Yitian Yuan, Xiaohan Lan, Long Chen, Wei Liu, Xin Wang, Wenwu Zhu
2021 arXiv   pre-print
In this paper, we first take a closer look at the existing evaluation protocol, and argue that both the prevailing datasets and metrics are the devils to cause the unreliable benchmarking.  ...  Despite Temporal Sentence Grounding in Videos (TSGV) has realized impressive progress over the last few years, current TSGV models tend to capture the moment annotation biases and fail to take full advantage  ...  Conclusion In this paper, we took a closer look at the existing evaluation protocol of the temporal sentence grounding in videos (TSGV) task, and we found that both the prevailing datasets and metrics  ... 
arXiv:2101.09028v2 fatcat:tlvfxoxr4nfcbetnaqc4adxs5a

A Closer Look at Debiased Temporal Sentence Grounding in Videos: Dataset, Metric, and Approach [article]

Xiaohan Lan, Yitian Yuan, Xin Wang, Long Chen, Zhi Wang, Lin Ma, Wenwu Zhu
2022 arXiv   pre-print
In this paper, we take a closer look at existing evaluation protocols, and find both the prevailing dataset and evaluation metrics are the devils that lead to untrustworthy benchmarking.  ...  Temporal Sentence Grounding in Videos (TSGV), which aims to ground a natural language sentence in an untrimmed video, has drawn widespread attention over the past few years.  ...  CONCLUSION In this paper, we take a closer look at mainstream benchmark datasets for temporal sentence grounding in videos and finds that there exists significant annotation bias, resulting in highly untrustworthy  ... 
arXiv:2203.05243v1 fatcat:lkyv5znigvdedfsffmnsslxq2e

Video Description: A Survey of Methods, Datasets and Evaluation Metrics [article]

Nayyer Aafaq, Ajmal Mian, Wei Liu, Syed Zulqarnain Gilani, Mubarak Shah
2019 arXiv   pre-print
Numerous methods, datasets and evaluation metrics have been proposed in the literature, calling the need for a comprehensive survey to focus research efforts in this flourishing new direction.  ...  Video description is the automatic generation of natural language sentences that describe the contents of a given video.  ...  The research was supported by ARC Discovery Grant DP160101458 and DP150102405.  ... 
arXiv:1806.00186v3 fatcat:elxztcpzizhr7clugnbjvvrpte

VideoSET: Video Summary Evaluation through Text [article]

Serena Yeung, Alireza Fathi, Li Fei-Fei
2014 arXiv   pre-print
We also release text annotations and ground-truth text summaries for a number of publicly available video datasets, for use by the computer vision community.  ...  Given a video summary, a text representation of the video summary is first generated, and an NLP-based metric is then used to measure its semantic distance to ground-truth text summaries written by humans  ...  Acknowledgements This research is partially supported by an ONR MURI grant and an Intel gift, and a Stanford Graduate Fellowship to S.Y.  ... 
arXiv:1406.5824v1 fatcat:yq6qeq7abzhgpcwqxxtufl74qi

OVC-Net: Object-Oriented Video Captioning with Temporal Graph and Detail Enhancement [article]

Fangyi Zhu, Jenq-Neng Hwang, Zhanyu Ma, Guang Chen, Jun Guo
2020 arXiv   pre-print
In this paper, we propose a novel task to understand the videos in object-level, named object-oriented video captioning.  ...  Thereafter, we construct a new dataset, providing consistent object-sentence pairs, to facilitate effective cross-modal learning.  ...  The boy in black tee and blue shorts walks to the right and looks at the woman drawing on the ground.  ... 
arXiv:2003.03715v5 fatcat:g5trretzdjauplie7estebze2a

Tripping through time: Efficient Localization of Activities in Videos [article]

Meera Hahn, Asim Kadav, James M. Rehg, Hans Peter Graf
2020 arXiv   pre-print
In our evaluation over Charades-STA, ActivityNet Captions and the TACoS dataset, we find that TripNet achieves high accuracy and saves processing time by only looking at 32-41% of the entire video.  ...  Localizing moments in untrimmed videos via language queries is a new and interesting task that requires the ability to accurately ground language into video.  ...  During training, we take a video and a single query sentence that has a ground truth temporal alignment in the clip.  ... 
arXiv:1904.09936v5 fatcat:4bvoeildkna2xejzdsvcjo6nhm

Localizing Moments in Video with Natural Language [article]

Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, Bryan Russell
2017 arXiv   pre-print
We consider retrieving a specific temporal segment, or moment, from a video given a natural language text description.  ...  A key obstacle to training our MCN model is that current video datasets do not include pairs of localized video segments and referring expressions, or text descriptions which uniquely identify a corresponding  ...  Language grounding in video has focused on spatially grounding objects and actions in a video [20, 55] , or aligning textual phrases to temporal video segments [28, 43] .  ... 
arXiv:1708.01641v1 fatcat:sgrv3qlhhfaujh6szkoxgwgmqa

Dense-Captioning Events in Videos [article]

Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, Juan Carlos Niebles
2017 arXiv   pre-print
We introduce the task of dense-captioning events, which involves both detecting and describing events in a video.  ...  Most natural videos contain numerous events. For example, in a video of a "man playing a piano", the video might also contain "another man dancing" or "a crowd clapping".  ...  This research was sponsored in part by grants from the Office of Naval Research (N00014-15-1-2813) and Panasonic, Inc.  ... 
arXiv:1705.00754v1 fatcat:wkph3qixdrbhllplsznnxawsqa

Leveraging Video Descriptions to Learn Video Question Answering

Kuo-Hao Zeng, Tseng-Hung Chen, Ching-Yao Chuang, Yuan-Hong Liao, Juan Carlos Niebles, Min Sun
2017 PROCEEDINGS OF THE THIRTIETH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE AND THE TWENTY-EIGHTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE  
In order to handle non-perfect candidate QA pairs, we propose a self-paced learning procedure to iteratively identify them and mitigate their effects in training.  ...  Our approach automatically harvests a large number of videos and descriptions freely available online.  ...  We also thank Shih-Han Chou, Heng Hsu, and I-Hsin Lee for their collaboration with the dataset.  ... 
doi:10.1609/aaai.v31i1.11238 fatcat:v3c6jcsavvdp3l4l3mvf67gw2e

Look Closer to Ground Better: Weakly-Supervised Temporal Grounding of Sentence in Video [article]

Zhenfang Chen, Lin Ma, Wenhan Luo, Peng Tang, Kwan-Yee K. Wong
2020 arXiv   pre-print
In this paper, we study the problem of weakly-supervised temporal grounding of sentence in video.  ...  Specifically, given an untrimmed video and a query sentence, our goal is to localize a temporal segment in the video that semantically corresponds to the query sentence, with no reliance on any temporal  ...  Hence, the best-matched proposal provides only a coarse grounding result. In order to achieve a more precise grounding result, we look closer at the coarse grounding result.  ... 
arXiv:2001.09308v1 fatcat:z2h6kx2j4rg2xbfq4sopn6dmy4

Leveraging Video Descriptions to Learn Video Question Answering [article]

Kuo-Hao Zeng, Tseng-Hung Chen, Ching-Yao Chuang, Yuan-Hong Liao, Juan Carlos Niebles, Min Sun
2016 arXiv   pre-print
In order to handle non-perfect candidate QA pairs, we propose a self-paced learning procedure to iteratively identify them and mitigate their effects in training.  ...  Our approach automatically harvests a large number of videos and descriptions freely available online.  ...  We also thank Shih-Han Chou, Heng Hsu, and I-Hsin Lee for their collaboration with the dataset.  ... 
arXiv:1611.04021v2 fatcat:t3eqot35knbs7lczo2ytb6uhcq

Identity-Aware Multi-Sentence Video Description [article]

Jae Sung Park, Trevor Darrell, Anna Rohrbach
2020 arXiv   pre-print
We propose a multi-sentence Identity-Aware Video Description task, which overcomes this limitation and requires to re-identify persons locally within a set of consecutive clips.  ...  We first generate multi-sentence video descriptions, and then apply our Fill-in the Identity model to establish links between the predicted person entities.  ...  The work of Trevor Darrell and Anna Rohrbach was in part supported by the DARPA XAI program, the Berkeley Artificial Intelligence Research (BAIR) Lab, and the Berkeley DeepDrive (BDD) Lab.  ... 
arXiv:2008.09791v1 fatcat:k5sl2e7t4bbenpv7j7v6jufssq

Understanding Objects in Video: Object-Oriented Video Captioning via Structured Trajectory and Adversarial Learning

Fangyi Zhu, Jenq-Neng Hwang, Zhanyu Ma, Guang Chen, Jun Guo
2020 IEEE Access  
Instead of a coarse and holistic description of the entire video or a sporadic object, we aim at understanding the video in object-level, which is closer to human experience while watching videos.  ...  Interaction in the right, in the left, towards, in front of, following, holding, back to, looking at, back and forth, ...  ... 
doi:10.1109/access.2020.3021857 fatcat:bhgzumf2yjdj3bpdxvu4sttqyy

Video Action Understanding: A Tutorial [article]

Matthew Hutchinson, Vijay Gadepally
2020 arXiv   pre-print
This tutorial clarifies a taxonomy of video action problems, highlights datasets and metrics used to baseline each problem, describes common data preparation methods, and presents the building blocks of  ...  Finding, identifying, and predicting actions are a few of the most salient tasks in video action understanding.  ...  The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the United States Air  ... 
arXiv:2010.06647v1 fatcat:hprgdsdtbfezvcbnwwxpr2n2gu

A Comprehensive Review on Recent Methods and Challenges of Video Description [article]

Alok Singh, Thoudam Doren Singh, Sivaji Bandyopadhyay
2020 arXiv   pre-print
metrics,and datasets.  ...  In this work, we report a comprehensive survey on the phases of video description approaches, the dataset for video description, evaluation metrics, open competitions for motivating the research on the  ...  In this metrics, as the score of higher n-gram matching increases, then it is assumed the output is closer to ground truth.  ... 
arXiv:2011.14752v1 fatcat:rsqcvh5wfffelhq3oqtykvifo4
« Previous Showing results 1 — 15 out of 1,952 results