Filters








31,087 Hits in 6.8 sec

Attention Clusters: Purely Attention Based Local Feature Integration for Video Classification [article]

Xiang Long, Chuang Gan, Gerard de Melo, Jiajun Wu, Xiao Liu, Shilei Wen
2017 arXiv   pre-print
We investigate the potential of a purely attention based local feature integration.  ...  Accounting for the characteristics of such features in video classification, we propose a local feature integration framework based on attention clusters, and introduce a shifting operation to capture  ...  Conclusion To explore the potential of pure attention networks for video classification, a new architecture based on attention clusters with a shifting operation is proposed to integrate local feature  ... 
arXiv:1711.09550v1 fatcat:ohkbb5mbnvgilhd2jv5lmbdpuq

Attention Clusters: Purely Attention Based Local Feature Integration for Video Classification

Xiang Long, Chuang Gan, Gerard de Melo, Jiajun Wu, Xiao Liu, Shilei Wen
2018 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition  
We investigate the potential of a purely attention based local feature integration.  ...  Accounting for the characteristics of such features in video classification, we propose a local feature integration framework based on attention clusters, and introduce a shifting operation to capture  ...  Conclusion To explore the potential of pure attention networks for video classification, a new architecture based on attention clusters with a shifting operation is proposed to integrate local feature  ... 
doi:10.1109/cvpr.2018.00817 dblp:conf/cvpr/LongGM0LW18 fatcat:taiozb4o25b2fbxbz3sjaekzzm

ViTransPAD: Video Transformer using convolution and self-attention for Face Presentation Attack Detection [article]

Zuheng Ming, Zitong Yu, Musab Al-Ghadi, Muriel Visani, Muhammad MuzzamilLuqman, Jean-Christophe Burie
2022 arXiv   pre-print
Inspired by ViT, we propose a Video-based Transformer for face PAD (ViTransPAD) with short/long-range spatio-temporal attention which can not only focus on local details with short attention within a frame  ...  Many works based on Convolution Neural Networks (CNNs) for face PAD formulate the problem as an image-level binary classification task without considering the context.  ...  The integrated CNNs force to capture the local spatial structure which allows to drop positional encoding being crucial for pure transformers.  ... 
arXiv:2203.01562v2 fatcat:r7h7zqk3nnearfxqu4dvqyrjfu

Shifted Chunk Transformer for Spatio-Temporal Representational Learning [article]

Xuefan Zha, Wentao Zhu, Tingxun Lv, Sen Yang, Ji Liu
2021 arXiv   pre-print
Leveraging the recent efficient Transformer design in NLP, this shifted chunk Transformer can learn hierarchical spatio-temporal features from a local tiny patch to a global video clip.  ...  However, the pure-Transformer based spatio-temporal learning can be prohibitively costly on memory and computation to extract fine-grained features from a tiny patch.  ...  No external funding was received for this work. Moreover, we would like to thank Hang Shang for insightful discussions.  ... 
arXiv:2108.11575v5 fatcat:55xjmhdphzfy3dux4iu2lxwtri

Recent Advances in Vision Transformer: A Survey for Different Domains [article]

Khawar Islam
2022 arXiv   pre-print
In this paper, we begin by introducing the fundamental concepts and background of the self-attention mechanism.  ...  Vision Transformers (ViTs) are becoming more popular and dominating technique for various vision tasks, compare to Convolutional Neural Networks (CNNs).  ...  [45] proposed a pure transformerbased Re-ID method to obtain rich features and extract indepth video representation.  ... 
arXiv:2203.01536v3 fatcat:eynby2wj6fgcpo7a5sgur74efu

Transformers Meet Visual Learning Understanding: A Comprehensive Review [article]

Yuting Yang, Licheng Jiao, Xu Liu, Fang Liu, Shuyuan Yang, Zhixi Feng, Xu Tang
2022 arXiv   pre-print
The latter contains object tracking and video classification. It is significant for comparing different models' performance in various tasks on several public benchmark data sets.  ...  Dynamic attention mechanism and global modeling ability make Transformer show strong feature learning ability. In recent years, Transformer has become comparable to CNNs methods in computer vision.  ...  The object tracking, video classification based on Transformer for video tasks are reviewed for video tasks. Fig. 4 .Figure 1 : 41 Fig. 4.  ... 
arXiv:2203.12944v1 fatcat:h2kgxfnqqvcbfelvpnteqpytcu

Attention mechanisms and deep learning for machine vision: A survey of the state of the art [article]

Abdul Mueed Hafiz, Shabir Ahmad Parah, Rouf Ul Alam Bhat
2021 arXiv   pre-print
Subsequently, the major categories of the intersection of attention mechanisms and deep learning for machine vision (MV) based are discussed.  ...  However, pure attention based models/architectures like transformers require huge data, large training times and large computational resources.  ...  Traditionally, CNN-based techniques for video classification usually performed 3D spatio-temporal manipulation on relatively small intervals for video understanding.  ... 
arXiv:2106.07550v1 fatcat:fzx6d6bwhfawhcbuiebdymmaei

Weakly-Supervised Temporal Action Localization Through Local-Global Background Modeling [article]

Xiang Wang, Zhiwu Qing, Ziyuan Huang, Yutong Feng, Shiwei Zhang, Jianwen Jiang, Mingqian Tang, Yuanjie Shao, Nong Sang
2021 arXiv   pre-print
Then our proposed Local-Global Background Modeling Network (LGBM-Net) is trained to localize instances by using only video-level labels based on Multi-Instance Learning (MIL).  ...  Weakly-Supervised Temporal Action Localization (WS-TAL) task aims to recognize and localize temporal starts and ends of action instances in an untrimmed video with only video-level label supervision.  ...  ViViT is a pure Transformer based model for action recognition.  ... 
arXiv:2106.11811v1 fatcat:gdtj4apmwzgivn2bk3gkf25lea

Learning Tracking Representations via Dual-Branch Fully Transformer Networks [article]

Fei Xie, Chunyu Wang, Guangting Wang, Wankou Yang, Wenjun Zeng
2021 arXiv   pre-print
Given a template and a search image, we divide them into non-overlapping patches and extract a feature vector for each patch based on its matching results with others within an attention window.  ...  We present a Siamese-like Dual-branch network based on solely Transformers for tracking.  ...  Acknowledgment We would like to thanks for advices from Chenyan Wu. This work was supported by NSFC (No.61773117 and No.62006041).  ... 
arXiv:2112.02571v1 fatcat:ifqryr6e45dkvlijmcjc7d6e5i

Efficient U-Transformer with Boundary-Aware Loss for Action Segmentation [article]

Dazhao Du, Bing Su, Yu Li, Zhongang Qi, Lingyu Si, Ying Shan
2022 arXiv   pre-print
Action classification has made great progress, but segmenting and recognizing actions from long untrimmed videos remains a challenging problem.  ...  In this paper, we design a pure Transformer-based model without temporal convolutions by incorporating the U-Net architecture.  ...  Although approaches based on various architectures have been proposed to improve the accuracy of video classification greatly, their performance is limited by the action segmentation task for untrimmed  ... 
arXiv:2205.13425v1 fatcat:jlms4ohgmvbeppwjyggw65ps2i

ActionFormer: Localizing Moments of Actions with Transformers [article]

Chenlin Zhang, Jianxin Wu, Yin Li
2022 arXiv   pre-print
Self-attention based Transformer models have demonstrated impressive results for image classification and object detection, and more recently for video understanding.  ...  Inspired by this success, we investigate the application of Transformer networks for temporal action localization in videos.  ...  Specifically, our model, dubbed ActionFormer, integrates local self-attention to extract a feature pyramid from an input video.  ... 
arXiv:2202.07925v1 fatcat:w2l5rfa74fglzebax3qukf5x3i

Knowing What, Where and When to Look: Efficient Video Action Modeling with Attention [article]

Juan-Manuel Perez-Rua and Brais Martinez and Xiatian Zhu and Antoine Toisoul and Victor Escorcia and Tao Xiang
2020 arXiv   pre-print
Crucially, it is extremely efficient by factorizing the high-dimensional video feature data into low-dimensional meaningful spaces (1D channel vector for 'what' and 2D spatial tensors for 'where'), followed  ...  Attentive video modeling is essential for action recognition in unconstrained videos due to their rich yet redundant information over space and time.  ...  In this paper, we focus on the ResNet-50 based TSM [29] as the main instantiation for integration with W3.  ... 
arXiv:2004.01278v1 fatcat:wm4rss3czjddzhio4hc3ad4oym

MM-ViT: Multi-Modal Video Transformer for Compressed Video Action Recognition [article]

Jiawei Chen, Chiu Man Ho
2021 arXiv   pre-print
This paper presents a pure transformer-based approach, dubbed the Multi-Modal Video Transformer (MM-ViT), for video action recognition.  ...  In addition, to further explore the rich inter-modal interactions and their effects, we develop and compare three distinct cross-modal attention mechanisms that can be seamlessly integrated into the transformer  ...  Furthermore, pure transformer models have also achieved competitive performance for vision tasks, e.g., image classification [14, 43] , object detection [43, 83] and video classification [2, 6] .  ... 
arXiv:2108.09322v2 fatcat:rasr4vrvijdg5crzlhy6ntmcnm

Exploiting Spatial-Temporal Modelling and Multi-Modal Fusion for Human Action Recognition [article]

Dongliang He, Fu Li, Qijie Zhao, Xiang Long, Yi Fu, Shilei Wen
2018 arXiv   pre-print
Xception network (iTXN) for video understanding.  ...  ., CNN+RNN), have been proposed in existing state-of-the-arts for this task, video modelling is far from being well solved.  ...  NetVLAD [13] , Action-VLAD [14] and Attention Clusters [15] are recently proposed to integrate local features for action recognition and good results are achieved by these encoding methods.  ... 
arXiv:1806.10319v1 fatcat:emgjryszvfbw7nfduxdbr36aaq

Attention Architectures for Machine Vision and Mobile Robots [chapter]

Lucas Paletta, Erich Rome, Hilary Buxton
2005 Neurobiology of Attention  
In robotic systems, we understand attention embedded in the context of optimizing sensorimotor behavior and multisensor-based active perception.  ...  We address successful methodologies on saliency and feature selection, describe attentive systems with respect to object and scene recognition, and review saccadic interpretation under decision processes  ...  The underlying model is based on the Feature Integration Theory from Treisman and Gelade (1980) .  ... 
doi:10.1016/b978-012375731-9/50109-9 fatcat:u33zz74qjnbkfeagryxffx7lve
« Previous Showing results 1 — 15 out of 31,087 results