Filters








80 Hits in 3.5 sec

Action Unit Memory Network for Weakly Supervised Temporal Action Localization [article]

Wang Luo, Tianzhu Zhang, Wenfei Yang, Jingen Liu, Tao Mei, Feng Wu, Yongdong Zhang
2021 arXiv   pre-print
Weakly supervised temporal action localization aims to detect and localize actions in untrimmed videos with only video-level labels during training.  ...  In this paper, we present an Action Unit Memory Network (AUMN) for weakly supervised temporal action localization, which can mitigate the above two challenges by learning an action unit memory bank.  ...  During testing, the goal of the temporal action localization is to generate a set of action proposals {(c, s, e, q)} for each video, where c and q denote the predicted category and the confidence score  ... 
arXiv:2104.14135v1 fatcat:iihw3z26xrghzomvtuvgpmtkcm

Action Shuffling for Weakly Supervised Temporal Localization [article]

Xiao-Yu Zhang, Haichao Shi, Changsheng Li, Xinchu Shi
2021 arXiv   pre-print
The intra-action shuffling branch lays out a self-supervised order prediction task to augment the video representation with inner-video relevance, whereas the inter-action shuffling branch imposes a reorganizing  ...  This paper analyzes the order-sensitive and location-insensitive properties of actions, and embodies them into a self-augmented learning framework to improve the weakly supervised action localization performance  ...  CONCLUSION In this paper, we have proposed a novel self-augmented framework, namely ActShufNet, for action localization in untrimmed videos with video-level weak supervision.  ... 
arXiv:2105.04208v1 fatcat:bprfv7h42rbk3h5l5k3xdj6o2q

Relaxed Transformer Decoders for Direct Action Proposal Generation [article]

Jing Tan, Jiaqi Tang, Limin Wang, Gangshan Wu
2021 arXiv   pre-print
Temporal action proposal generation is an important and challenging task in video understanding, which aims at detecting all temporal segments containing action instances of interest.  ...  Finally, we devise a three-branch head to further improve the proposal confidence estimation by explicitly predicting its completeness.  ...  However, these action recognition methods cannot be directly applied for realistic video analysis due to the fact that these web videos are untrimmed in nature.  ... 
arXiv:2102.01894v3 fatcat:lxzztsicyfgmtnd5a7rksukxsu

Fine-grained Iterative Attention Network for TemporalLanguage Localization in Videos [article]

Xiaoye Qu, Pengwei Tang, Zhikang Zhou, Yu Cheng, Jianfeng Dong, Pan Zhou
2020 arXiv   pre-print
Temporal language localization in videos aims to ground one video segment in an untrimmed video based on a given sentence query.  ...  In this paper, we propose a Fine-grained Iterative Attention Network (FIAN) that consists of an iterative attention module for bilateral query-video in-formation extraction.  ...  ACKNOWLEDGMENTS This work was supported in part by the National Natural Science Foundation of China (No. 61972448), National Natural Science Foundation of China (No. 61902347), and Zhejiang Provincial  ... 
arXiv:2008.02448v1 fatcat:lohkhfoiufd2ngc2ak737ha75y

Weakly-Supervised Action Detection Guided by Audio Narration [article]

Keren Ye, Adriana Kovashka
2022 arXiv   pre-print
For example, EPIC Kitchens is the largest dataset in first-person (egocentric) vision, yet it still relies on crowdsourced information to refine the action boundaries to provide instance-level action annotations  ...  We explored how to eliminate the expensive annotations in video detection data which provide refined boundaries.  ...  The experiments provide insights for weakly supervised action detection methods in noisy untrimmed videos.  ... 
arXiv:2205.05895v1 fatcat:rc3fqyzzjnc5re5p2sdey6ulfy

Weakly-Supervised Video Moment Retrieval via Semantic Completion Network [article]

Zhijie Lin, Zhou Zhao, Zhu Zhang, Qi Wang, Huasheng Liu
2020 arXiv   pre-print
In this paper, we propose a novel weakly-supervised moment retrieval framework requiring only coarse video-level annotations for training.  ...  refinement.  ...  Temporal Action Detection: Temporal Action Detection aims at identifying the temporal boundary as well as the category for each action instance in untrimmed videos.  ... 
arXiv:1911.08199v3 fatcat:7vwjsnr6cza7fj74rifxd22sdm

ABN: Agent-Aware Boundary Networks for Temporal Action Proposal Generation

Khoa Vo, Kashu Yamazaki, Sang Truong, Minh-Triet Tran, Akihiro Sugimoto, Ngan Le
2021 IEEE Access  
Temporal action proposal generation (TAPG) aims to estimate temporal intervals of actions in untrimmed videos, which is a challenging yet plays an important role in many tasks of video analysis and understanding  ...  the untrimmed videos to extract video visual representation.  ...  TEMPORAL ACTION PROPOSAL GENERATION (TAPG) TAPG aims to propose temporal intervals that may contain an action instance with their temporal boundaries and confidence in untrimmed videos.  ... 
doi:10.1109/access.2021.3110973 fatcat:q4gzd4kccbde3lw4usgyab5j6y

Weakly-Supervised Video Moment Retrieval via Semantic Completion Network

Zhijie Lin, Zhou Zhao, Zhu Zhang, Qi Wang, Huasheng Liu
2020 PROCEEDINGS OF THE THIRTIETH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE AND THE TWENTY-EIGHTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE  
In this paper, we propose a novel weakly-supervised moment retrieval framework requiring only coarse video-level annotations for training.  ...  refinement.  ...  Temporal Action Detection: Temporal Action Detection aims at identifying the temporal boundary as well as the category for each action instance in untrimmed videos.  ... 
doi:10.1609/aaai.v34i07.6820 fatcat:zveh5blsg5ehvapv2aes7unvye

A Survey on Temporal Sentence Grounding in Videos [article]

Xiaohan Lan, Yitian Yuan, Xin Wang, Zhi Wang, Wenwu Zhu
2021 arXiv   pre-print
Temporal sentence grounding in videos(TSGV), which aims to localize one target segment from an untrimmed video with respect to a given sentence query, has drawn increasing attentions in the research community  ...  Meanwhile, TSGV is more challenging since it requires both textual and visual understanding for semantic alignment between two modalities(i.e., text and video).  ...  The action selection is depicted by a switch over the interface in the tree-structured policy. The alignment network will predict a confidence score to determine when to stop.  ... 
arXiv:2109.08039v2 fatcat:6ja4csssjzflhj426eggaf77tu

Weakly Supervised Temporal Adjacent Network for Language Grounding [article]

Yuechen Wang, Jiajun Deng, Wengang Zhou, Houqiang Li
2021 arXiv   pre-print
In this work, we are dedicated to weakly supervised TLG, where multiple description sentences are given to an untrimmed video without temporal boundary labels.  ...  Moreover, we integrate a complementary branch into the framework, which explicitly refines the predictions with pseudo supervision from the MIL stage.  ...  of actions in untrimmed videos and localize the start and end frames of detected actions.  ... 
arXiv:2106.16136v1 fatcat:i7nwetztpraf5bjwxfrn2gpa2a

Progressively Guide to Attend: An Iterative Alignment Framework for Temporal Sentence Grounding [article]

Daizong Liu, Xiaoye Qu, Pan Zhou
2021 arXiv   pre-print
A key solution to temporal sentence grounding (TSG) exists in how to learn effective alignment between vision and language features extracted from an untrimmed video and a sentence description.  ...  In this paper, we propose an Iterative Alignment Network (IA-Net) for TSG task, which iteratively interacts inter- and intra-modal features within multiple steps for more accurate grounding.  ...  This work was supported in part by the National Natural Science Foundation of China under grant No. 61972448.  ... 
arXiv:2109.06400v1 fatcat:fiprcfnszzcjvfdup6lwaizlvq

Weakly-Supervised Spatio-Temporal Anomaly Detection in Surveillance Video [article]

Jie Wu, Wei Zhang, Guanbin Li, Wenhao Wu, Xiao Tan, Yingying Li, Errui Ding, Liang Lin
2021 arXiv   pre-print
It impels the learned concepts of each branch to serve as a guide for its counterpart, which progressively refines the corresponding branch and the whole framework.  ...  Mutually-guided Progressive Refinement framework is set up to employ dual-path mutual guidance in a recurrent manner, iteratively sharing auxiliary supervision information across branches.  ...  Mutually-guided Progressive Refinement Multiple Instance Learning.  ... 
arXiv:2108.03825v1 fatcat:q4wkot3ywrc3pogbwuii5bqqzm

Graph Convolutional Networks for Temporal Action Localization [article]

Runhao Zeng, Wenbing Huang, Mingkui Tan, Yu Rong, Peilin Zhao, Junzhou Huang, Chuang Gan
2019 arXiv   pre-print
However, the relations between proposals actually play an important role in action localization, since a meaningful action always consists of multiple proposals in a video.  ...  Here, we use two types of relations, one for capturing the context information for each proposal and the other one for characterizing the correlations between distinct actions.  ...  ActivityNet [4] is another popular benchmark for action localization on untrimmed videos.  ... 
arXiv:1909.03252v1 fatcat:h3ovcv2dobcufn6lmmcbz3mgfm

Argus++: Robust Real-time Activity Detection for Unconstrained Video Streams with Overlapping Cube Proposals [article]

Lijun Yu, Yijun Qian, Wenhe Liu, Alexander G. Hauptmann
2022 arXiv   pre-print
Therefore, they failed to deal with the multi-scale multi-instance cases in real-world unconstrained video streams, which are untrimmed and have large field-of-views.  ...  To overcome these issues, we propose Argus++, a robust real-time activity detection system for analyzing unconstrained video streams.  ...  This would result in a large amount of false truncate low-confidence predictions as this happens automat- alarms unless we deduplicate them.  ... 
arXiv:2201.05290v1 fatcat:n6xmnelt7nbs5cnefcgjrik4cm

Exploring Optical-Flow-Guided Motion and Detection-Based Appearance for Temporal Sentence Grounding [article]

Daizong Liu, Xiang Fang, Wei Hu, Pan Zhou
2022 arXiv   pre-print
Temporal sentence grounding aims to localize a target segment in an untrimmed video semantically according to a given sentence query.  ...  features in each frame for filtering out the redundant background contents.  ...  In contrast, the two-stage methods first generate action proposals, then refine and classify confident proposals.  ... 
arXiv:2203.02966v1 fatcat:ith2bpxlgbhlpjy4bky2dnzbqa
« Previous Showing results 1 — 15 out of 80 results