5,404 Hits in 4.0 sec

Multi-Head Attention: Collaborate Instead of Concatenate [article]

Jean-Baptiste Cordonnier, Andreas Loukas, Martin Jaggi
2021 arXiv   pre-print
Collaborative multi-head attention reduces the size of the key and query projections by 4 for same accuracy and speed. Our code is public.  ...  We also show that it is possible to re-parametrize a pre-trained multi-head attention layer into our collaborative attention layer.  ...  We improve the understanding of transformers by questioning a specific part of the model: the concatenation of multiple heads.  ... 
arXiv:2006.16362v2 fatcat:pulug65azzhm7dbjnun2rcb2ke

Heterogeneous Graph Attention Networks for Learning Diverse Communication [article]

Esmaeil Seraj, Zheyuan Wang, Rohan Paleja, Matthew Sklar, Anirudh Patel, Matthew Gombolay
2021 arXiv   pre-print
We propose heterogeneous graph attention networks, called HetNet, to learn efficient and diverse communication models for coordinating heterogeneous agents towards accomplishing tasks that are of collaborative  ...  However, when collaborating a team of agents with different action and observation spaces, information sharing is not straightforward and requires customized communication protocols, depending on sender  ...  Any subjective views or opinions that might be expressed in the paper do not necessarily represent the views of the U.S. Department of Energy or the United States Government.  ... 
arXiv:2108.09568v2 fatcat:kqnvqafqmzg3jc3jsbbcuwq5wm

An Attentive Survey of Attention Models [article]

Sneha Chaudhari, Varun Mithal, Gungor Polatkan, Rohan Ramanath
2021 arXiv   pre-print
We also describe how attention has been used to improve the interpretability of neural networks. Finally, we discuss some future research directions in attention.  ...  This survey provides a structured and comprehensive overview of the developments in modeling attention. In particular, we propose a taxonomy which groups existing techniques into coherent categories.  ...  The decoder is similar to the encoder, except that the decoder contains two multi-head attention sub-modules instead of one.  ... 
arXiv:1904.02874v3 fatcat:fyqgqn7sxzdy3efib3rrqexs74

Dual-branch Attention-In-Attention Transformer for single-channel speech enhancement [article]

Guochen Yu, Andong Li, Chengshi Zheng, Yinuo Guo, Yutian Wang, Hui Wang
2022 arXiv   pre-print
Specifically, the proposed attention-in-attention transformer consists of adaptive temporal-frequency attention transformer blocks and an adaptive hierarchical attention module, aiming to capture long-term  ...  Motivated by that, we propose a dual-branch attention-in-attention transformer dubbed DB-AIAT to handle both coarse- and fine-grained regions of the spectrum in parallel.  ...  In each branch, an improved transformer [6] is employed, which is comprised of a multi-head selfattention (MHSA) module and a GRU-based feed-forward network, followed by residual connections and LN.  ... 
arXiv:2110.06467v5 fatcat:abrljaopwnctpm3dpywtsbgevi

HAMLET: A Hierarchical Multimodal Attention-based Human Activity Recognition Algorithm [article]

Md Mofijul Islam, Tariq Iqbal
2020 arXiv   pre-print
HAMLET incorporates a hierarchical architecture, where the lower layer encodes spatio-temporal features from unimodal data by adopting a multi-head self-attention mechanism.  ...  We further visualize the unimodal and multimodal attention maps, which provide us with a tool to interpret the impact of attention mechanisms concerning HAR.  ...  However, it did not utilize attention methods to fuse the multimodal features, instead those were concatenated. D.  ... 
arXiv:2008.01148v1 fatcat:hjh2z5cp7faxxkalynflahnd5y

An attention-driven hierarchical multi-scale representation for visual recognition [article]

Zachary Wharton, Ardhendu Behera, Asish Bera
2021 arXiv   pre-print
These regions consist of smaller (closer look) to larger (far look), and the dependency between regions is modeled by an innovative attention-driven message propagation, guided by the graph structure to  ...  emphasize the neighborhoods of a given region.  ...  regions in our multi-scale hierarchical structure and varying number attention heads and output channels per attention head.  ... 
arXiv:2110.12178v1 fatcat:g3lgduvmxbhc7hn3s5hfgmzxa4

High-resolution Depth Maps Imaging via Attention-based Hierarchical Multi-modal Fusion [article]

Zhiwei Zhong, Xianming Liu, Junjun Jiang, Debin Zhao, Zhiwen Chen, Xiangyang Ji
2021 arXiv   pre-print
In this paper, we propose a novel attention-based hierarchical multi-modal fusion (AHMF) network for guided DSR.  ...  Furthermore, we propose a bi-directional hierarchical feature collaboration (BHFC) module to fully leverage low-level spatial information and high-level structure information among multi-scale features  ...  Instead of pixel-wise ad-dition, Huang et al. [28] concatenated the low-level and highlevel features to fuse the multi-level features. Gu et al.  ... 
arXiv:2104.01530v2 fatcat:hj5jdijgfjcqjpqgyl2wrlmxoa

Self-and-Mixed Attention Decoder with Deep Acoustic Structure for Transformer-based LVCSR [article]

Xinyuan Zhou, Grandee Lee, Emre Yılmaz, Yanhua Long, Jiaen Liang, Haizhou Li
2020 arXiv   pre-print
Specifically, we introduce a self-attention mechanism to learn a multi-layer deep acoustic structure for multiple levels of acoustic abstraction.  ...  It uses the encoder-decoder structure with self-attention to learn the relationship between the high-level representation of the source inputs and embedding of the target outputs.  ...  Multi-Head Attention Multi-head attention is the core module of the Transformer model.  ... 
arXiv:2006.10407v2 fatcat:6tifaidk7rfbjdczff3cf224uq

Crowd Counting Using Scale-Aware Attention Networks [article]

Mohammad Asiful Hossain, Mehrdad Hosseinzadeh, Omit Chanda, Yang Wang
2019 arXiv   pre-print
One challenge of crowd counting is the scale variation in images. In this work, we propose a novel scale-aware attention network to address this challenge.  ...  By combining these global and local scale attention, our model outperforms other state-of-the-art methods for crowd counting on several benchmark datasets.  ...  Acknowledgment This work was supported by an NSERC Engage grant in collaboration with Sightline Innovation. We thank NVIDIA for donating some of the GPUs used in this work. Figure 6 .  ... 
arXiv:1903.02025v1 fatcat:s6ivmtoozngahny3gwiryexaea

Multi-Pointer Co-Attention Networks for Recommendation

Yi Tay, Anh Tuan Luu, Siu Cheung Hui
2018 Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining - KDD '18  
Finally, we propose a multi-pointer learning scheme that learns to combine multiple views of interactions between user and item.  ...  We study the behavior of our multi-pointer learning mechanism, shedding light on evidence aggregation patterns in review-based recommender systems.  ...  This is reminiscent of (and inspired by) the Transformer [28] architecture which uses multi-headed attention, concatenating outputs of each attention call.  ... 
doi:10.1145/3219819.3220086 dblp:conf/kdd/TayLH18 fatcat:57ri3khbpnabjccjaojk5mttpy

Improving Long-Tail Relation Extraction with Collaborating Relation-Augmented Attention [article]

Yang Li, Tao Shen, Guodong Long, Jing Jiang, Tianyi Zhou, Chengqi Zhang
2020 arXiv   pre-print
Recent works alleviate the wrong labeling by selective attention via multi-instance learning, but cannot well handle long-tail relations even if hierarchies of the relations are introduced to share knowledge  ...  In this work, we propose a novel neural network, Collaborating Relation-augmented Attention (CoRA), to handle both the wrong labeling and long-tail relations.  ...  Acknowledgement This research was funded by the Australian Government through the Australian Research Council (ARC) under the grant of LP180100654.  ... 
arXiv:2010.03773v2 fatcat:4bbqy2nlazdidcpvpi5fdilhdq

D-HAN: Dynamic News Recommendation with Hierarchical Attention Network [article]

Qinghua Zhao, Xu Chen, Hui Zhang, Shuai Ma
2021 arXiv   pre-print
For capturing users' dynamic preferences, the continuous time information is seamlessly incorporated into the computing of the attention weights.  ...  More specifically, we design a hierarchical attention network, where the lower layer learns the importance of different sentences and elements, and the upper layer captures the correlations between the  ...  heads of multi-head Transformer is 2.  ... 
arXiv:2112.10085v1 fatcat:zuehraf7ujgllitiikznl2qklq

GrAMME: Semi-Supervised Learning using Multi-layered Graph Attention Models [article]

Uday Shankar Shanthamallu, Jayaraman J. Thiagarajan, Huan Song and Andreas Spanias
2019 arXiv   pre-print
Modern data analysis pipelines are becoming increasingly complex due to the presence of multi-view information sources.  ...  In this paper, we consider the problem of semi-supervised learning with multi-layered graphs. Though deep network embeddings, e.g.  ...  This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.  ... 
arXiv:1810.01405v2 fatcat:gydvjhal4jfbfi24bvnbsrpr2u

Relation-Aware Graph Attention Network for Visual Question Answering [article]

Linjie Li, Zhe Gan, Yu Cheng, Jingjing Liu
2019 arXiv   pre-print
We propose a Relation-aware Graph Attention Network (ReGAT), which encodes each image into a graph and models multi-type inter-object relations via a graph attention mechanism, to learn question-adaptive  ...  Experiments demonstrate that ReGAT outperforms prior state-of-the-art approaches on both VQA 2.0 and VQA-CP v2 datasets.  ...  The dimension of the hidden layer in GRU is set as 1024. We employ multi-head attention with 16 heads for all three graph attention networks. The dimension of relation features is set to 1024.  ... 
arXiv:1903.12314v3 fatcat:2ed2iwme7jdwxaejelj3limvmu

Group-Node Attention for Community Evolution Prediction [article]

Matt Revelle, Carlotta Domeniconi, Ben Gelman
2021 arXiv   pre-print
The model (GNAN) includes a group-node attention component which enables support for variable-sized inputs and learned representation of groups based on member and neighbor node features.  ...  Additionally, we show the effects of network trends on model performance.  ...  Fig. 3 provides a diagram of the multi-head group-node attention portion of where FCNout uses a sigmoid activation function to  ... 
arXiv:2107.04522v1 fatcat:w4yb4o25rzhu3cmda6mqfwuny4
« Previous Showing results 1 — 15 out of 5,404 results