Filters








7 Hits in 3.3 sec

Transformer Dissection: An Unified Understanding for Transformer's Attention via the Lens of Kernel

Yao-Hung Hubert Tsai, Shaojie Bai, Makoto Yamada, Louis-Philippe Morency, Ruslan Salakhutdinov
2019 Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)  
At the core of the Transformer is the attention mechanism, which concurrently processes all inputs in the streams. In this paper, we present a new formulation of attention via the lens of the kernel.  ...  As an example, we propose a new variant of Transformer's attention which models the input as a product of symmetric kernels.  ...  Acknowledgments We thank Zhilin Yang for helpful discussion on the positional encoding in Transformer's Attention.  ... 
doi:10.18653/v1/d19-1443 dblp:conf/emnlp/TsaiBYMS19 fatcat:qq7v6p2ssfablf6qtu2bsx557y

Transformer Dissection: A Unified Understanding of Transformer's Attention via the Lens of Kernel [article]

Yao-Hung Hubert Tsai and Shaojie Bai and Makoto Yamada and Louis-Philippe Morency and Ruslan Salakhutdinov
2019 arXiv   pre-print
At the core of the Transformer is the attention mechanism, which concurrently processes all inputs in the streams. In this paper, we present a new formulation of attention via the lens of the kernel.  ...  As an example, we propose a new variant of Transformer's attention which models the input as a product of symmetric kernels.  ...  Acknowledgments We thank Zhilin Yang for helpful discussion on the positional encoding in Transformer's Attention.  ... 
arXiv:1908.11775v4 fatcat:innb3tmlw5ambbncdgytjsw4wy

On the Ability and Limitations of Transformers to Recognize Formal Languages [article]

Satwik Bhattamishra, Kabir Ahuja, Navin Goyal
2020 arXiv   pre-print
Our analysis also provides insights on the role of self-attention mechanism in modeling certain behaviors and the influence of positional encoding schemes on the learning and generalization abilities of  ...  We first provide a construction of Transformers for a subclass of counter languages, including well-studied languages such as n-ary Boolean Expressions, Dyck-1, and its generalizations.  ...  Transformer dissection: An unified under- standing for transformer's attention via the lens of kernel.  ... 
arXiv:2009.11264v2 fatcat:bdyjkxoyyvfyzo5afdasl7ehnm

On the Computational Power of Transformers and its Implications in Sequence Modeling [article]

Satwik Bhattamishra, Arkil Patel, Navin Goyal
2020 arXiv   pre-print
We further analyze the necessity of each component for the Turing-completeness of the network; interestingly, we find that a particular type of residual connection is necessary.  ...  In particular, the roles of various components in Transformers such as positional encodings, attention heads, residual connections, and feedforward networks, are not clear.  ...  Acknowledgements We thank the anonymous reviewers for their constructive comments and suggestions.  ... 
arXiv:2006.09286v3 fatcat:ssnkohoqczbghlejr6ivftefbq

On the Ability and Limitations of Transformers to Recognize Formal Languages

Satwik Bhattamishra, Kabir Ahuja, Navin Goyal
2020 Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)   unpublished
Our analysis also provides insights on the role of self-attention mechanism in modeling certain behaviors and the influence of positional encoding schemes on the learning and generalization abilities of  ...  We first provide a construction of Transformers for a subclass of counter languages, including well-studied languages such as n-ary Boolean Expressions, Dyck-1, and its generalizations.  ...  Acknowledgements We thank the anonymous reviewers for their constructive comments and suggestions.  ... 
doi:10.18653/v1/2020.emnlp-main.576 fatcat:o6puxlb555dfto2laeboktovy4

Graph Neural Networks for Natural Language Processing: A Survey [article]

Lingfei Wu, Yu Chen, Kai Shen, Xiaojie Guo, Hanning Gao, Shucheng Li, Jian Pei, Bo Long
2021 arXiv   pre-print
To the best of our knowledge, this is the first comprehensive overview of Graph NeuralNetworks for Natural Language Processing.  ...  Finally, we discussvarious outstanding challenges for making the full use of GNNs for NLP as well as future researchdirections.  ...  Transformer dissection: An unified understanding for transformer’s attention via the lens of kernel.  ... 
arXiv:2106.06090v1 fatcat:zvkhinpcvzbmje4kjpwjs355qu

On the Computational Power of Transformers and Its Implications in Sequence Modeling

Satwik Bhattamishra, Arkil Patel, Navin Goyal
2020 Proceedings of the 24th Conference on Computational Natural Language Learning   unpublished
We further analyze the necessity of each component for the Turing-completeness of the network; interestingly, we find that a particular type of residual connection is necessary.  ...  In particular, the roles of various components in Transformers such as positional encodings, attention heads, residual connections, and feedforward networks, are not clear.  ...  Acknowledgements We thank the anonymous reviewers for their constructive comments and suggestions.  ... 
doi:10.18653/v1/2020.conll-1.37 fatcat:boffvpweenhq5ggjgf5d27qoy4