Filters








1,914 Hits in 5.8 sec

Spatiotemporal Self-attention Modeling with Temporal Patch Shift for Action Recognition [article]

Wangmeng Xiang, Chao Li, Biao Wang, Xihan Wei, Xian-Sheng Hua, Lei Zhang
2022 arXiv   pre-print
In this paper, we propose a Temporal Patch Shift (TPS) method for efficient 3D self-attention modeling in transformers for video-based action recognition.  ...  TPS shifts part of patches with a specific mosaic pattern in the temporal dimension, thus converting a vanilla spatial self-attention operation to a spatiotemporal one with little additional cost.  ...  CNN based methods typically use 3D convolution [37, 4, 13] or 2D-CNN with temporal modeling [39, 31, 22] to construct effective backbones for action recognition.  ... 
arXiv:2207.13259v1 fatcat:7rchlyhaobb7ro343pojd6avei

Video Swin Transformer [article]

Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, Han Hu
2021 arXiv   pre-print
even with spatial-temporal factorization.  ...  These video models are all built on Transformer layers that globally connect patches across the spatial and temporal dimensions.  ...  Because of this property, full spatiotemporal self-attention can be well-approximated by self-attention computed locally, at a significant saving in computation and model size.  ... 
arXiv:2106.13230v1 fatcat:kgorpgmyfja7zjtgtn7rdgcr7i

On the Surprising Effectiveness of Transformers in Low-Labeled Video Recognition [article]

Farrukh Rahman, Ömer Mubarek, Zsolt Kira
2022 arXiv   pre-print
However, in the image classification setting this flexibility comes with a trade-off with respect to sample efficiency, where transformers require ImageNet-scale training.  ...  This notion has carried over to video where transformers have not yet been explored for video classification in the low-labeled or semi-supervised settings.  ...  We thank Yen-Cheng Liu for discusison and comments on our work.  ... 
arXiv:2209.07474v1 fatcat:zs5ukspb4jcz7hi3tiztf6pwtq

Vision Transformers for Action Recognition: A Survey [article]

Anwaar Ulhaq, Naveed Akhtar, Ganna Pogrebna, Ajmal Mian
2022 arXiv   pre-print
Moreover, we also investigate different network learning strategies, such as self-supervised and zero-shot learning, along with their associated losses for transformer-based action recognition.  ...  Within the context of action transformers, we explore the techniques to encode spatio-temporal data, dimensionality reduction, frame patch and spatio-temporal cube construction, and various representation  ...  Recognition Spatio-Temporal Tuples Transformer for Skeleton-Based Action Recognition STAR: Sparse Transformer-based Action Recognition Action Transformer: A Self-Attention Model for Short-Time Pose-Based  ... 
arXiv:2209.05700v1 fatcat:eg4jxpagnrcrfcwh2w4qhvrcoi

MM-ViT: Multi-Modal Video Transformer for Compressed Video Action Recognition [article]

Jiawei Chen, Chiu Man Ho
2021 arXiv   pre-print
In order to handle the large number of spatiotemporal tokens extracted from multiple modalities, we develop several scalable model variants which factorize self-attention across the space, time and modality  ...  This paper presents a pure transformer-based approach, dubbed the Multi-Modal Video Transformer (MM-ViT), for video action recognition.  ...  Another persistent challenge in action recognition is to effectively and efficiently model the temporal structure with large variations and complexities.  ... 
arXiv:2108.09322v2 fatcat:rasr4vrvijdg5crzlhy6ntmcnm

Zero-Shot Action Recognition with Transformer-based Video Semantic Embedding [article]

Keval Doshi, Yasin Yilmaz
2022 arXiv   pre-print
While video action recognition has been an active area of research for several years, zero-shot action recognition has only recently started gaining traction.  ...  Specifically, we advocate for a concrete formulation for zero-shot action recognition that avoids an exact overlap between the training and testing classes and also limits the intra-class variance; and  ...  with all patches through time and space self-attention.  ... 
arXiv:2203.05156v1 fatcat:ny7p72hia5govbg2g6qboew3da

An Image is Worth 16x16 Words, What is a Video Worth? [article]

Gilad Sharir, Asaf Noy, Lihi Zelnik-Manor
2021 arXiv   pre-print
Leading methods in the domain of action recognition try to distill information from both the spatial and temporal dimensions of an input video.  ...  Our approach relies on a temporal transformer that applies global attention over video frames, and thus better exploits the salient information in each frame.  ...  In this section, We design a convolution-free model that is fully based on self-attention blocks for the spatiotemporal domain.  ... 
arXiv:2103.13915v2 fatcat:dpwxoxo6wzdj7fenj4yzgxzlli

SAIC_Cambridge-HuPBA-FBK Submission to the EPIC-Kitchens-100 Action Recognition Challenge 2021 [article]

Swathikiran Sudhakaran and Adrian Bulat and Juan-Manuel Perez-Rua and Alex Falcon and Sergio Escalera and Oswald Lanz and Brais Martinez and Georgios Tzimiropoulos
2021 arXiv   pre-print
GSF is an efficient spatio-temporal feature extracting module that can be plugged into 2D CNNs for video action recognition.  ...  We design an ensemble of GSF and XViT model families with different backbones and pretraining to generate the prediction scores.  ...  XViT Vision transformers [3] can be extended for video recognition by extending the self attention mechanism between tokens within a frame to tokens from other frames as well.  ... 
arXiv:2110.02902v1 fatcat:ikbzj6ic7zgwbkjrbpeexwvn2i

DualFormer: Local-Global Stratified Transformer for Efficient Video Recognition [article]

Yuxuan Liang, Pan Zhou, Roger Zimmermann, Shuicheng Yan
2022 arXiv   pre-print
While transformers have shown great potential on video recognition tasks with their strong capability of capturing long-range dependencies, they often suffer high computational costs induced by self-attention  ...  In this paper, we propose a new transformer architecture, termed DualFormer, which can effectively and efficiently perform space-time attention for video recognition.  ...  Acknowledgement The authors would like to thank Quanhong Fu at Sea AI Lab for the help to improve the technical writing aspect of this paper.  ... 
arXiv:2112.04674v2 fatcat:ao2wbhpd5jbblfycytvf2zioei

Efficient Spatialtemporal Context Modeling for Action Recognition [article]

Congqi Cao, Yue Lu, Yifan Zhang, Dongmei Jiang, Yanning Zhang
2021 arXiv   pre-print
video for action recognition.  ...  However, directly modeling the contextual information between any two points brings huge cost in computation and memory, especially for action recognition, where there is an additional temporal dimension  ...  We design a 3D recurrent criss-cross attention module that is most suitable to model the long-range spatiotemporal contextual information for action recognition. • Extensive experiments with different  ... 
arXiv:2103.11190v2 fatcat:wrlnsswa7fcf5krgppgjukxhcm

Learning Temporally Invariant and Localizable Features via Data Augmentation for Video Recognition [article]

Taeoh Kim, Hyeongmin Lee, MyeongAh Cho, Ho Seong Lee, Dong Heon Cho, Sangyoun Lee
2020 arXiv   pre-print
, including the 1st Visual Inductive Priors (VIPriors) for data-efficient action recognition challenge.  ...  Deep-Learning-based video recognition has shown promising improvements along with the development of large-scale datasets and spatiotemporal network architectures.  ...  Video recognition For video action recognition, like image recognition, various architectures have been proposed to capture spatiotemporal features from videos.  ... 
arXiv:2008.05721v1 fatcat:s4crrbntjvbtxjafnt2jj6h63q

SpatioTemporal Focus for Skeleton-based Action Recognition [article]

Liyu Wu, Can Zhang, Yuexian Zou
2022 arXiv   pre-print
Graph convolutional networks (GCNs) are widely adopted in skeleton-based action recognition due to their powerful ability to model data topology.  ...  As a result, more explainable representations for different skeleton action sequences can be obtained by MCF.  ...  Attention mechanism is also widely used in action recognition, e.g., method like [35] utilizes the self-attention mechanism to capture the global spatiotemporal features in videos.  ... 
arXiv:2203.16767v1 fatcat:ryblrlan6bdopj5lktxfmqv2my

MAR: Masked Autoencoders for Efficient Action Recognition [article]

Zhiwu Qing, Shiwei Zhang, Ziyuan Huang, Xiang Wang, Yuehuan Wang, Yiliang Lv, Changxin Gao, Nong Sang
2022 arXiv   pre-print
Standard approaches for video recognition usually operate on the full input videos, which is inefficient due to the widely present spatio-temporal redundancy in videos.  ...  Inspired by this, we propose propose Masked Action Recognition (MAR), which reduces the redundant computation by discarding a proportion of patches and operating only on a part of the videos.  ...  adaptive temporal kernels [33] , capturing temporal difference for motion modelling [3] , [34] , [35] , and shifting part of the channels along the temporal dimension [36] , etc.  ... 
arXiv:2207.11660v1 fatcat:a2tqxvnwl5fgdnecjff7bdvkme

DirecFormer: A Directed Attention in Transformer Approach to Robust Action Recognition [article]

Thanh-Dat Truong, Quoc-Huy Bui, Chi Nhan Duong, Han-Seok Seo, Son Lam Phung, Xin Li, Khoa Luu
2022 arXiv   pre-print
This work presents a novel end-to-end Transformer-based Directed Attention (DirecFormer) framework for robust action recognition.  ...  Various 3D-CNN based methods have been presented to tackle both the spatial and temporal dimensions in the task of video action recognition with competitive results.  ...  In this paper, we therefore address two fundamental questions for current action recognition models.  ... 
arXiv:2203.10233v1 fatcat:a6uhwyrtfbhxjdkhrpiily4ys4

Learning Attention-Enhanced Spatiotemporal Representation for Action Recognition

Zhensheng Shi, Liangjie Cao, Cheng Guan, Haiyong Zheng, Zhaorui Gu, Zhibin Yu, Bing Zheng
2020 IEEE Access  
Learning spatiotemporal features via 3D-CNN (3D Convolutional Neural Network) models has been regarded as an effective approach for action recognition.  ...  INDEX TERMS Action recognition, video understanding, spatiotemporal representation, visual attention, 3D-CNN, residual learning.  ...  attention-enhanced spatiotemporal representation for action recognition.  ... 
doi:10.1109/access.2020.2968024 fatcat:xtatmkiqk5egxjwfxgqb5q5zqa
« Previous Showing results 1 — 15 out of 1,914 results