4 Hits in 5.1 sec

Transferable Knowledge-Based Multi-Granularity Aggregation Network for Temporal Action Localization: Submission to ActivityNet Challenge 2021 [article]

Haisheng Su, Peiqin Zhuang, Yukun Li, Dongliang Wang, Weihao Gan, Wei Wu, Yu Qiao
2021 arXiv   pre-print
This technical report presents an overview of our solution used in the submission to 2021 HACS Temporal Action Localization Challenge on both Supervised Learning Track and Weakly-Supervised Learning Track  ...  In this paper, to train a supervised temporal action localizer, we adopt Temporal Context Aggregation Network (TCANet) to generate high-quality action proposals through "local and global" temporal context  ...  For the weakly supervised learning track, we propose a unified network named as transferable knowledge based Multi-Granularity Fusion Network (KT-MGFN) for WSTAL.  ... 
arXiv:2107.12618v1 fatcat:c5xm6eixyjfojobuegdam3duli

A Comprehensive Review of the Video-to-Text Problem [article]

Jesus Perez-Martin and Benjamin Bustos and Silvio Jamil F. Guimarães and Ivan Sipiran and Jorge Pérez and Grethel Coello Said
2021 arXiv   pre-print
Research in the Vision and Language area encompasses challenging topics that seek to connect visual and textual information.  ...  This review categorizes and describes the state-of-the-art techniques for the video-to-text problem. It covers the main video-to-text methods and the ways to evaluate their performance.  ...  ActivityNet Captions dataset includes temporal action proposal, temporal action localization, and captions related to each action (video segment).  ... 
arXiv:2103.14785v3 fatcat:xwzziozwjbghfobtowu5bny6bu

Ego4D: Around the World in 3,000 Hours of Egocentric Video [article]

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Nagarajan (+73 others)
2022 arXiv   pre-print
The approach to collection is designed to uphold rigorous privacy and ethics standards with consenting participants and robust de-identification procedures where relevant.  ...  Furthermore, we present a host of new benchmark challenges centered around understanding the first-person visual experience in the past (querying an episodic memory), present (analyzing hand-object manipulation  ...  Acknowledgements We gratefully acknowledge the following colleagues for valuable discussions and support of our project: Aaron Adcock, Andrew Allen, Behrouz Behmardi, Serge Belongie, Mark Broyles, Xiao  ... 
arXiv:2110.07058v3 fatcat:lgh27km63nhcdcpkvbr2qarsru

Gesture Similarity Learning and Retrieval in Large-Scale Real-world Video Collections

Mahnaz Parian-Scherb
Although, with advances in natural language processing methods, there have been various contributions in this field, computer vision tools and methods are not prominently used to aid the researchers in  ...  Their model uses the I3D network [13] for extracting the base features and a region proposal network [67] for a sampling mechanism to localize people performing actions.  ...  Boundary-based methods: Boundary-based methods eliminate the need for a sliding window for temporal localization.  ... 
doi:10.5451/unibas-ep84855 fatcat:kpkjeytdinf25omxqk6cx3ybee