A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2019; you can also visit the original URL.
The file type is application/pdf
.
Filters
Attention Clusters: Purely Attention Based Local Feature Integration for Video Classification
[article]
2017
arXiv
pre-print
We investigate the potential of a purely attention based local feature integration. ...
Accounting for the characteristics of such features in video classification, we propose a local feature integration framework based on attention clusters, and introduce a shifting operation to capture ...
Conclusion To explore the potential of pure attention networks for video classification, a new architecture based on attention clusters with a shifting operation is proposed to integrate local feature ...
arXiv:1711.09550v1
fatcat:ohkbb5mbnvgilhd2jv5lmbdpuq
Attention Clusters: Purely Attention Based Local Feature Integration for Video Classification
2018
2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
We investigate the potential of a purely attention based local feature integration. ...
Accounting for the characteristics of such features in video classification, we propose a local feature integration framework based on attention clusters, and introduce a shifting operation to capture ...
Conclusion To explore the potential of pure attention networks for video classification, a new architecture based on attention clusters with a shifting operation is proposed to integrate local feature ...
doi:10.1109/cvpr.2018.00817
dblp:conf/cvpr/LongGM0LW18
fatcat:taiozb4o25b2fbxbz3sjaekzzm
ViTransPAD: Video Transformer using convolution and self-attention for Face Presentation Attack Detection
[article]
2022
arXiv
pre-print
Inspired by ViT, we propose a Video-based Transformer for face PAD (ViTransPAD) with short/long-range spatio-temporal attention which can not only focus on local details with short attention within a frame ...
Many works based on Convolution Neural Networks (CNNs) for face PAD formulate the problem as an image-level binary classification task without considering the context. ...
The integrated CNNs force to capture the local spatial structure which allows to drop positional encoding being crucial for pure transformers. ...
arXiv:2203.01562v2
fatcat:r7h7zqk3nnearfxqu4dvqyrjfu
Shifted Chunk Transformer for Spatio-Temporal Representational Learning
[article]
2021
arXiv
pre-print
Leveraging the recent efficient Transformer design in NLP, this shifted chunk Transformer can learn hierarchical spatio-temporal features from a local tiny patch to a global video clip. ...
However, the pure-Transformer based spatio-temporal learning can be prohibitively costly on memory and computation to extract fine-grained features from a tiny patch. ...
No external funding was received for this work. Moreover, we would like to thank Hang Shang for insightful discussions. ...
arXiv:2108.11575v5
fatcat:55xjmhdphzfy3dux4iu2lxwtri
Recent Advances in Vision Transformer: A Survey for Different Domains
[article]
2022
arXiv
pre-print
In this paper, we begin by introducing the fundamental concepts and background of the self-attention mechanism. ...
Vision Transformers (ViTs) are becoming more popular and dominating technique for various vision tasks, compare to Convolutional Neural Networks (CNNs). ...
[45] proposed a pure transformerbased Re-ID method to obtain rich features and extract indepth video representation. ...
arXiv:2203.01536v3
fatcat:eynby2wj6fgcpo7a5sgur74efu
Transformers Meet Visual Learning Understanding: A Comprehensive Review
[article]
2022
arXiv
pre-print
The latter contains object tracking and video classification. It is significant for comparing different models' performance in various tasks on several public benchmark data sets. ...
Dynamic attention mechanism and global modeling ability make Transformer show strong feature learning ability. In recent years, Transformer has become comparable to CNNs methods in computer vision. ...
The object tracking, video classification based on Transformer for video tasks are reviewed for video tasks.
Fig. 4 .Figure 1 : 41 Fig. 4. ...
arXiv:2203.12944v1
fatcat:h2kgxfnqqvcbfelvpnteqpytcu
Attention mechanisms and deep learning for machine vision: A survey of the state of the art
[article]
2021
arXiv
pre-print
Subsequently, the major categories of the intersection of attention mechanisms and deep learning for machine vision (MV) based are discussed. ...
However, pure attention based models/architectures like transformers require huge data, large training times and large computational resources. ...
Traditionally, CNN-based techniques for video classification usually performed 3D spatio-temporal manipulation on relatively small intervals for video understanding. ...
arXiv:2106.07550v1
fatcat:fzx6d6bwhfawhcbuiebdymmaei
Weakly-Supervised Temporal Action Localization Through Local-Global Background Modeling
[article]
2021
arXiv
pre-print
Then our proposed Local-Global Background Modeling Network (LGBM-Net) is trained to localize instances by using only video-level labels based on Multi-Instance Learning (MIL). ...
Weakly-Supervised Temporal Action Localization (WS-TAL) task aims to recognize and localize temporal starts and ends of action instances in an untrimmed video with only video-level label supervision. ...
ViViT is a pure Transformer based model for action recognition. ...
arXiv:2106.11811v1
fatcat:gdtj4apmwzgivn2bk3gkf25lea
Learning Tracking Representations via Dual-Branch Fully Transformer Networks
[article]
2021
arXiv
pre-print
Given a template and a search image, we divide them into non-overlapping patches and extract a feature vector for each patch based on its matching results with others within an attention window. ...
We present a Siamese-like Dual-branch network based on solely Transformers for tracking. ...
Acknowledgment We would like to thanks for advices from Chenyan Wu. This work was supported by NSFC (No.61773117 and No.62006041). ...
arXiv:2112.02571v1
fatcat:ifqryr6e45dkvlijmcjc7d6e5i
Efficient U-Transformer with Boundary-Aware Loss for Action Segmentation
[article]
2022
arXiv
pre-print
Action classification has made great progress, but segmenting and recognizing actions from long untrimmed videos remains a challenging problem. ...
In this paper, we design a pure Transformer-based model without temporal convolutions by incorporating the U-Net architecture. ...
Although approaches based on various architectures have been proposed to improve the accuracy of video classification greatly, their performance is limited by the action segmentation task for untrimmed ...
arXiv:2205.13425v1
fatcat:jlms4ohgmvbeppwjyggw65ps2i
ActionFormer: Localizing Moments of Actions with Transformers
[article]
2022
arXiv
pre-print
Self-attention based Transformer models have demonstrated impressive results for image classification and object detection, and more recently for video understanding. ...
Inspired by this success, we investigate the application of Transformer networks for temporal action localization in videos. ...
Specifically, our model, dubbed ActionFormer, integrates local self-attention to extract a feature pyramid from an input video. ...
arXiv:2202.07925v1
fatcat:w2l5rfa74fglzebax3qukf5x3i
Knowing What, Where and When to Look: Efficient Video Action Modeling with Attention
[article]
2020
arXiv
pre-print
Crucially, it is extremely efficient by factorizing the high-dimensional video feature data into low-dimensional meaningful spaces (1D channel vector for 'what' and 2D spatial tensors for 'where'), followed ...
Attentive video modeling is essential for action recognition in unconstrained videos due to their rich yet redundant information over space and time. ...
In this paper, we focus on the ResNet-50 based TSM [29] as the main instantiation for integration with W3. ...
arXiv:2004.01278v1
fatcat:wm4rss3czjddzhio4hc3ad4oym
MM-ViT: Multi-Modal Video Transformer for Compressed Video Action Recognition
[article]
2021
arXiv
pre-print
This paper presents a pure transformer-based approach, dubbed the Multi-Modal Video Transformer (MM-ViT), for video action recognition. ...
In addition, to further explore the rich inter-modal interactions and their effects, we develop and compare three distinct cross-modal attention mechanisms that can be seamlessly integrated into the transformer ...
Furthermore, pure transformer models have also achieved competitive performance for vision tasks, e.g., image classification [14, 43] , object detection [43, 83] and video classification [2, 6] . ...
arXiv:2108.09322v2
fatcat:rasr4vrvijdg5crzlhy6ntmcnm
Exploiting Spatial-Temporal Modelling and Multi-Modal Fusion for Human Action Recognition
[article]
2018
arXiv
pre-print
Xception network (iTXN) for video understanding. ...
., CNN+RNN), have been proposed in existing state-of-the-arts for this task, video modelling is far from being well solved. ...
NetVLAD [13] , Action-VLAD [14] and Attention Clusters [15] are recently proposed to integrate local features for action recognition and good results are achieved by these encoding methods. ...
arXiv:1806.10319v1
fatcat:emgjryszvfbw7nfduxdbr36aaq
Attention Architectures for Machine Vision and Mobile Robots
[chapter]
2005
Neurobiology of Attention
In robotic systems, we understand attention embedded in the context of optimizing sensorimotor behavior and multisensor-based active perception. ...
We address successful methodologies on saliency and feature selection, describe attentive systems with respect to object and scene recognition, and review saccadic interpretation under decision processes ...
The underlying model is based on the Feature Integration Theory from Treisman and Gelade (1980) . ...
doi:10.1016/b978-012375731-9/50109-9
fatcat:u33zz74qjnbkfeagryxffx7lve
« Previous
Showing results 1 — 15 out of 31,087 results