Filters








54,378 Hits in 3.8 sec

Long-Term Feature Banks for Detailed Video Understanding [article]

Chao-Yuan Wu, Christoph Feichtenhofer, Haoqi Fan, Kaiming He, Philipp Krähenbühl, Ross Girshick
2019 arXiv   pre-print
We propose a long-term feature bank---supportive information extracted over the entire span of a video---to augment state-of-the-art video models that otherwise would only view short clips of 2-5 seconds  ...  Our experiments demonstrate that augmenting 3D convolutional networks with a long-term feature bank yields state-of-the-art results on three challenging video datasets: AVA, EPIC-Kitchens, and Charades  ...  Long-Term Feature Bank 3D CNN input Related Work Deep networks are the dominant approach for video understanding [5, 21, 33, 39, 46-48, 50, 51, 56] .  ... 
arXiv:1812.05038v2 fatcat:nwdgyfurzzhaljykxyi6dbfdge

Long-Term Feature Banks for Detailed Video Understanding

Chao-Yuan Wu, Christoph Feichtenhofer, Haoqi Fan, Kaiming He, Philipp Krahenbuhl, Ross Girshick
2019 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)  
We propose a long-term feature bank-supportive information extracted over the entire span of a video-to augment state-of-the-art video models that otherwise would only view short clips of 2-5 seconds.  ...  Our experiments demonstrate that augmenting 3D convolutional networks with a long-term feature bank yields state-of-the-art results on three challenging video datasets: AVA, EPIC-Kitchens, and Charades  ...  Long-Term Feature Bank 3D CNN input Related Work Deep networks are the dominant approach for video understanding [5, 21, 33, 39, 46-48, 50, 51, 56] .  ... 
doi:10.1109/cvpr.2019.00037 dblp:conf/cvpr/WuF0HKG19 fatcat:5vp6t547dfb6hcagxvtu5cbd2a

Context-Aware RCNN: A Baseline for Action Detection in Videos [article]

Jianchao Wu, Zhanghui Kuang, Limin Wang, Wayne Zhang, Gangshan Wu
2020 arXiv   pre-print
Our approach can serve as a strong baseline for video action detection and is expected to inspire new ideas for this filed. The code is available at .  ...  Video action detection approaches usually conduct actor-centric action recognition over RoI-pooled features following the standard pipeline of Faster-RCNN.  ...  This work is supported by SenseTime Research Fund for Young Scholars, the National Science Foundation of China (No. 61921006), Program for Innovative Talents and Entrepreneur in Jiangsu Province, and Collaborative  ... 
arXiv:2007.09861v1 fatcat:cntx4bbblven7jdrwhiwd7hi4y

Three Branches: Detecting Actions With Richer Features [article]

Jin Xia, Jiajun Tang, Cewu Lu
2019 arXiv   pre-print
This model seeks to fuse richer information of global video clip, short human attention and long-term human activity into a unified model.  ...  For Kinetics, we achieve 21.59% error rate.  ...  We first fine-tune the simple SlowFast networks to extract features for the long-term feature banks. Then we fine-tune the whole networks which invoke long-term features and global features.  ... 
arXiv:1908.04519v1 fatcat:27rxbrhisffnvaauz2mruccw24

Action Genome: Actions as Composition of Spatio-temporal Scene Graphs [article]

Jingwei Ji, Ranjay Krishna, Li Fei-Fei, Juan Carlos Niebles
2019 arXiv   pre-print
With Action Genome, we extend an existing action recognition model by incorporating scene graphs as spatio-temporal feature banks to achieve better performance on the Charades dataset.  ...  It contains 10K videos with 0.4M objects and 1.7M visual relationships annotated.  ...  Our model is most directly related to the recent long-term feature banks [75] , which accumulates features of a long video as a fixed size representation for action recognition.  ... 
arXiv:1912.06992v1 fatcat:6iap73ap2zbi7bxdkrtvkn66wi

Associating Objects with Transformers for Video Object Segmentation [article]

Zongxin Yang, Yunchao Wei, Yi Yang
2021 arXiv   pre-print
For sufficiently modeling multi-object association, a Long Short-Term Transformer is designed for constructing hierarchical matching and propagation.  ...  In detail, AOT employs an identification mechanism to associate multiple targets into the same high-dimensional embedding space.  ...  A.3 Illustration of Long Short-term Attention To facilitate understanding our long-term and short-term attention modules, we illustrate their processes in Fig. 5 .  ... 
arXiv:2106.02638v3 fatcat:mwhmxpp2u5dmplxyquvkin4y7i

Context R-CNN: Long Term Temporal Context for Per-Camera Object Detection [article]

Sara Beery, Guanhang Wu, Vivek Rathod, Ronny Votel, Jonathan Huang
2020 arXiv   pre-print
Specifically, we propose an attention-based approach that allows our model, Context R-CNN, to index into a long term memory bank constructed on a per-camera basis and aggregate contextual features from  ...  In static monitoring cameras, useful contextual information can stretch far beyond the few seconds typical video understanding models might see: subjects may exhibit similar behavior over multiple days  ...  Acknowlegdements We would like to thank Pietro Perona, David Ross, Zhichao Lu, Ting Yu, Tanya Birch and the Wildlife Insights Team, Joe Marino, and Oisin MacAodha for their valuable insight.  ... 
arXiv:1912.03538v3 fatcat:ocabhva3azfinlwlypiplbweyu

Actor-Context-Actor Relation Network for Spatio-Temporal Action Localization [article]

Junting Pan, Siyu Chen, Zheng Shou, Jing Shao, Hongsheng Li
2020 arXiv   pre-print
Moreover, to allow utilizing more temporal contexts, we extend our framework with an Actor-Context Feature Bank for reasoning long-range high-order relations.  ...  Localizing persons and recognizing their actions from videos is a challenging task towards high-level video understanding.  ...  Actor-Context Feature Bank Inspired by the Long-term Feature Bank (LFB) [34] , which creates a feature bank over a large time span to facilitate first-order actor-actor relation reasoning across a long  ... 
arXiv:2006.07976v1 fatcat:fnrglifes5bdljepuyvl477daq

Self-Supervised Video Object Segmentation by Motion-Aware Mask Propagation [article]

Bo Miao, Mohammed Bennamoun, Yongsheng Gao, Ajmal Mian
2021 arXiv   pre-print
MAMP then propagates the masks from the memory bank to subsequent frames according to our proposed motion-aware spatio-temporal matching module to handle fast motion and long-term matching scenarios.  ...  During inference, MAMP extracts high-resolution features from each frame to build a memory bank from the features as well as the predicted masks of selected past frames.  ...  However, the incremental memory bank updates are impractical when segmenting long videos due to the growing memory cost. In this work, we divide the memory into long-term and short-term memory.  ... 
arXiv:2107.12569v2 fatcat:escvlh6fsbclllgv34xtsd5paq

Symbiotic Attention with Privileged Information for Egocentric Action Recognition [article]

Xiaohan Wang, Yu Wu, Linchao Zhu, Yi Yang
2020 arXiv   pre-print
Finer position-aware object detection features can facilitate the understanding of actor's interaction with the object.  ...  Egocentric video recognition is a natural testbed for diverse interaction reasoning.  ...  LFB (Wu et al. 2019a ) combines Long-Term Feature Banks (detection features) with 3D CNN to improve the accuracy of object recognition.  ... 
arXiv:2002.03137v1 fatcat:3f5dehbhsfbztmzuwoz4ghtwsi

Symbiotic Attention with Privileged Information for Egocentric Action Recognition

Xiaohan Wang, Yu Wu, Linchao Zhu, Yi Yang
2020 PROCEEDINGS OF THE THIRTIETH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE AND THE TWENTY-EIGHTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE  
Finer position-aware object detection features can facilitate the understanding of actor's interaction with the object.  ...  Egocentric video recognition is a natural testbed for diverse interaction reasoning.  ...  LFB (Wu et al. 2019a ) combines Long-Term Feature Banks (detection features) with 3D CNN to improve the accuracy of object recognition.  ... 
doi:10.1609/aaai.v34i07.6907 fatcat:pj2y2yurxfeoleyuxcozhmql5q

Aggregating Long-Term Context for Learning Laparoscopic and Robot-Assisted Surgical Workflows [article]

Yutong Ban, Guy Rosman, Thomas Ward, Daniel Hashimoto, Taisei Kondo, Hidekazu Iwaki, Ozanan Meireles, Daniela Rus
2021 arXiv   pre-print
Analyzing surgical workflow is crucial for surgical assistance robots to understand surgeries.  ...  We propose a new temporal network structure that leverages task-specific network representation to collect long-term sufficient statistics that are propagated by a sufficient statistics model (SSM).  ...  Sufficient Statistic Features Different choices of summarization S can make it easy for the network to learn long-term interactions.  ... 
arXiv:2009.00681v4 fatcat:5kqgfyw2cbeglk5lcrvx5a5jxu

Filter Learning from Deep Descriptors of a Fully Convolutional Siamese Network for Tracking in Videos

Hugo Chaves, Kevyn Ribeiro, André Brito, Hemerson Tacon, Marcelo Vieira, Augusto Cerqueira, Saulo Villela, Helena Maia, Darwin Concha, Helio Pedrini
2020 Proceedings of the 15th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications  
Specifically, we propose a combination of the signal of descriptors in long and short term memory blocks, which represent the first and the recent appearance of the object, respectively.  ...  The filter bank is then used to compute the short term memory output. According to experiments performed in the widely used OTB dataset, our proposal improves the baseline performance.  ...  There is also no uniformity for frame rate and resolution, and there are short and long term videos.  ... 
doi:10.5220/0008957606850694 dblp:conf/visapp/ChavesRBTVCVMCP20 fatcat:jrg3ytdimrh6dpq5rfizqrynpy

Temporal Context Aggregation for Video Retrieval with Contrastive Learning [article]

Jie Shao, Xin Wen, Bingchen Zhao, Xiangyang Xue
2020 arXiv   pre-print
In this paper, we propose TCA (Temporal Context Aggregation for Video Retrieval), a video representation learning framework that incorporates long-range temporal information between frame-level features  ...  The current research focus on Content-Based Video Retrieval requires higher-level video representation describing the long-range semantic dependencies of relevant incidents, events, etc.  ...  In terms of the sequence models, the Long Short-Term Memory (LSTM) [20] and Gated Recurrent Unit (GRU) [8] are commonly used for video re-localization and copy detection [13, 22] .  ... 
arXiv:2008.01334v2 fatcat:yyohxhiq45cipewfj3qyr43oaa

Condensed Movies: Story Based Retrieval with Contextual Embeddings [article]

Max Bain, Arsha Nagrani, Andrew Brown, Andrew Zisserman
2020 arXiv   pre-print
Our objective in this work is long range understanding of the narrative structure of movies.  ...  It is also an order of magnitude larger than existing movie datasets in the number of movies; (ii) We provide a deep network baseline for text-to-video retrieval on our dataset, combining character, speech  ...  We are grateful to Samuel Albanie for his help with feature extraction.  ... 
arXiv:2005.04208v2 fatcat:h2ib4fpmpra3jhrlz4j5whbveq
« Previous Showing results 1 — 15 out of 54,378 results