Filters








238 Hits in 5.9 sec

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context [article]

Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov
2019 arXiv   pre-print
Transformers have a potential of learning longer-term dependency, but are limited by a fixed-length context in the setting of language modeling.  ...  We propose a novel neural architecture Transformer-XL that enables learning dependency beyond a fixed length without disrupting temporal coherence.  ...  CONCLUSIONS We propose a novel architecture, Transformer-XL, for language modeling with self-attention architectures beyond a fixed-length context.  ... 
arXiv:1901.02860v3 fatcat:5yp5qndscrb3bk3f5jts7byqri

Transformer-XL: Attentive Language Models beyond a Fixed-Length Context

Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc Le, Ruslan Salakhutdinov
2019 Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics  
Transformers have a potential of learning longer-term dependency, but are limited by a fixed-length context in the setting of language modeling.  ...  We propose a novel neural architecture Transformer-XL that enables learning dependency beyond a fixed length without disrupting temporal coherence.  ...  As a consequence of the fixed context length, the model cannot capture any longer-term dependency beyond the predefined context length.  ... 
doi:10.18653/v1/p19-1285 dblp:conf/acl/DaiYYCLS19 fatcat:uj5slqod75aufbmxohs2i6gria

Transformer-XL Based Music Generation with Multiple Sequences of Time-valued Notes [article]

Xianchao Wu and Chengyuan Wang and Qinying Lei
2020 arXiv   pre-print
sequence-based Transformer-XL, evaluated automatically and manually.  ...  other is the separated usage of four sequences, namely, former note on to current note on, note on to note off, pitch, and velocity, for jointly training of four Transformer-XL networks.  ...  of among language words or music notes in this paper) beyond a fixed length without disrupting temporal coherence.  ... 
arXiv:2007.07244v1 fatcat:if6w762udvgilkk64atveu35ai

Relational Memory-Augmented Language Models

Qi Liu, Dani Yogatama, Phil Blunsom
2022 Transactions of the Association for Computational Linguistics  
We present a memory-augmented approach to condition an autoregressive language model on a knowledge graph.  ...  Our model provides a simple yet effective way to combine an autoregressive language model and a knowledge graph for more coherent and logical generation.  ...  We also thank Angeliki Lazaridou, Cyprien de Masson d'Autume, Lingpeng Kong, Laura Rimell, Aida Nematzadeh, and the DeepMind language team for their helpful discussions.  ... 
doi:10.1162/tacl_a_00476 fatcat:upvkhst7hjdaxavtmbjf4vp53a

Streaming Transformer-based Acoustic Models Using Self-attention with Augmented Memory [article]

Chunyang Wu, Yongqiang Wang, Yangyang Shi, Ching-Feng Yeh, Frank Zhang
2020 arXiv   pre-print
In this work, we proposed a novel augmentedmemory self-attention, which attends on a short segment of theinput sequence and a bank of memories.  ...  Transformer-based acoustic modeling has achieved great suc-cess for both hybrid and sequence-to-sequence speech recogni-tion.  ...  An example of this approach is transformer-XL [20] , in which it can model a very long dependency on text data for language modeling.  ... 
arXiv:2005.08042v1 fatcat:7u4uyw6ywvbf3iq7ygkfnydqou

Streaming Transformer-Based Acoustic Models Using Self-Attention with Augmented Memory

Chunyang Wu, Yongqiang Wang, Yangyang Shi, Ching-Feng Yeh, Frank Zhang
2020 Interspeech 2020  
Index Terms: streaming speech recognition, transformer, acoustic modeling Transformer-based acoustic models We first give a brief introduction of self-attention that is the core of the transformer-based  ...  In this work, we proposed a novel augmented memory self-attention, which attends on a short segment of the input sequence and a bank of memories.  ...  The transformer-XL baseline used a segment length of 128, which is identical to that of the proposed model. The "+look-ahead" reports the extension of transformer-XL with right context 5 .  ... 
doi:10.21437/interspeech.2020-2079 dblp:conf/interspeech/WuWSYZ20 fatcat:p5awv5fk4bbmvcknsendoyo624

Relational Memory Augmented Language Models [article]

Qi Liu, Dani Yogatama, Phil Blunsom
2022 arXiv   pre-print
We present a memory-augmented approach to condition an autoregressive language model on a knowledge graph.  ...  Our model provides a simple yet effective way to combine an autoregressive language model with a knowledge graph for a more coherent and logical generation.  ...  We also thank Angeliki Lazaridou, Cyprien de Masson d'Autume, Lingpeng Kong, Laura Rimell, Aida Nematzadeh, and the DeepMind language team for their helpful discussions.  ... 
arXiv:2201.09680v1 fatcat:vjtm4625c5bjlkvkqqe4awmmoy

General-purpose, long-context autoregressive modeling with Perceiver AR [article]

Curtis Hawthorne, Andrew Jaegle, Cătălina Cangea, Sebastian Borgeaud, Charlie Nash, Mateusz Malinowski, Sander Dieleman, Oriol Vinyals, Matthew Botvinick, Ian Simon, Hannah Sheahan, Neil Zeghidour (+3 others)
2022 arXiv   pre-print
Perceiver AR can directly attend to over a hundred thousand tokens, enabling practical long-context density estimation without the need for hand-crafted sparsity patterns or memory mechanisms.  ...  We develop Perceiver AR, an autoregressive, modality-agnostic architecture which uses cross-attention to map long-range inputs to a small number of latents while also maintaining end-to-end causal masking  ...  Transformer-XL (ours) is a reimplementation in our codebase. We train Perceiver AR models with varying context lengths.  ... 
arXiv:2202.07765v1 fatcat:rqqiu6jczzc55a3gmw3nysvztq

Do Transformers Need Deep Long-Range Memory [article]

Jack W. Rae, Ali Razavi
2020 arXiv   pre-print
For language modelling in particular, the Transformer-XL -- a Transformer augmented with a long-range memory of past activations -- has been shown to be state-of-the-art across a variety of well-studied  ...  Deep attention models have advanced the modelling of sequential data across many domains.  ...  For regular language modelling, Daniluk et al. (2017) observed that an LSTM augmented with attention would rarely attend beyond seven preceding words of context.  ... 
arXiv:2007.03356v1 fatcat:rlgnwc4oyncijdsw533n72xaqu

Memformer: A Memory-Augmented Transformer for Sequence Modeling [article]

Qingyang Wu, Zhenzhong Lan, Kun Qian, Jing Gu, Alborz Geramifard, Zhou Yu
2022 arXiv   pre-print
We also propose a new optimization scheme, memory replay back-propagation (MRBP), which promotes long-range back-propagation through time with a significantly reduced memory requirement.  ...  Analysis of the attention pattern shows that our external memory slots can encode and retain important information through timesteps.  ...  Transformer-XL (Dai et al., 2019) used relative positional encoding and consisted of a segmentlevel recurrence mechanism to encode beyond a fixed-length context.  ... 
arXiv:2010.06891v2 fatcat:hueckemqr5hmncnlmabhk6djz4

Hierarchical Transformers Are More Efficient Language Models [article]

Piotr Nawrot, Szymon Tworkowski, Michał Tyrolski, Łukasz Kaiser, Yuhuai Wu, Christian Szegedy, Henryk Michalewski
2022 arXiv   pre-print
We use the best performing upsampling and downsampling layers to create Hourglass - a hierarchical Transformer language model.  ...  These large language models are impressive but also very inefficient and costly, which limits their applications and accessibility.  ...  Finally, we use Hourglass with relative attention parametrization from Transformer-XL (Dai et al., 2019) , evaluate it on three language modeling tasks, and compare the results with other models.  ... 
arXiv:2110.13711v2 fatcat:wyf2cm6zujbuhhzskeqinu3adq

XLNet: Generalized Autoregressive Pretraining for Language Understanding [article]

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le
2020 arXiv   pre-print
Furthermore, XLNet integrates ideas from Transformer-XL, the state-of-the-art autoregressive model, into pretraining.  ...  With the capability of modeling bidirectional contexts, denoising autoencoding based pretraining like BERT achieves better performance than pretraining approaches based on autoregressive language modeling  ...  A.2 Two-Stream Attention Here, we provide the implementation details of the two-stream attention with a Transformer-XL backbone.  ... 
arXiv:1906.08237v2 fatcat:cjhjaocte5ew3dyuuwny5jqjxi

Highway Transformer: Self-Gating Enhanced Self-Attentive Networks [article]

Yekun Chai, Shuo Jin, Xinwen Hou
2020 arXiv   pre-print
Self-attention mechanisms have made striking state-of-the-art (SOTA) progress in various sequence learning tasks, standing on the multi-headed dot product attention by attending to all the global contexts  ...  Through a pseudo information highway, we introduce a gated component self-dependency units (SDU) that incorporates LSTM-styled gating units to replenish internal semantic importance within the multi-dimensional  ...  Segment-level Recurrence In Transformer-XL, the previous hidden states are cached and reused to inject the history information and attend to contexts beyond a fixed length through multi-layer stacks.  ... 
arXiv:2004.08178v5 fatcat:oiun7rh3pzcypc33nscuvztidu

Not All Memories are Created Equal: Learning to Forget by Expiring [article]

Sainbayar Sukhbaatar, Da Ju, Spencer Poff, Stephen Roller, Arthur Szlam, Jason Weston, Angela Fan
2021 arXiv   pre-print
Next, we show that Expire-Span can scale to memories that are tens of thousands in size, setting a new state of the art on incredibly long context tasks such as character-level language modeling and a  ...  Attention mechanisms have shown promising results in sequence modeling tasks that require long-term memory.  ...  Without the ability to forget, the Transformer-XL models require large memory for storing all navigation steps that grow with the corridor length.  ... 
arXiv:2105.06548v2 fatcat:ekbejwehvrcsrkiw467e7lfy6u

Multi-Sense Language Modelling [article]

Andrea Lekkas, Peter Schneider-Kamp, Isabelle Augenstein
2021 arXiv   pre-print
Currently, none of the common language modelling architectures explicitly model polysemy. We propose a language model which not only predicts the next word, but also its sense in context.  ...  We find that multi-sense language modelling requires architectures that go beyond standard language models, and here propose a structured prediction framework that decomposes the task into a word followed  ...  two separate Transformer-XL models.  ... 
arXiv:2012.05776v2 fatcat:egh76uaeqrdb3kflrfisfn6x7q
« Previous Showing results 1 — 15 out of 238 results