A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2020; you can also visit the original URL.
The file type is application/pdf
.
Filters
Structured Pruning of Large Language Models
[article]
2019
arXiv
pre-print
Large language models have recently achieved state of the art performance across a wide variety of natural language tasks. ...
Meanwhile, the size of these models and their latency have significantly increased, which makes their usage costly, and raises an interesting question: do language models need to be large? ...
This work contributes to reducing the growing overhead of large language models, and shines a light on the role of model capacity in language modeling. ...
arXiv:1910.04732v1
fatcat:o2daer4ftraalg4jfnvssv6tgq
Structured Pruning of Large Language Models
2020
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
unpublished
Large language models have recently achieved state of the art performance across a wide variety of natural language tasks. ...
We also demonstrate that our method can be applied to pruning adaptive word embeddings in large language models, and to pruning the BERT model on several downstream fine-tuning classification benchmarks ...
We would also like to thank Hugh Perkins, Sam Bowman, Nicholas Matthews, Josh Shapiro and the other members of the Language Technology and Research teams who helped review this work and contributed their ...
doi:10.18653/v1/2020.emnlp-main.496
fatcat:n4rj2e6carcy3kiuzm3rmv355m
Efficient Transformer-based Large Scale Language Representations using Hardware-friendly Block Structured Pruning
[article]
2020
arXiv
pre-print
In this work, we propose an efficient transformer-based large-scale language representation using hardware-friendly block structure pruning. ...
Pre-trained large-scale language models have increasingly demonstrated high accuracy on many natural language processing (NLP) tasks. ...
In this work, we propose an efficient Transformer-based large-scale language representations using block structured pruning. ...
arXiv:2009.08065v4
fatcat:gef7hlznirgszirmoqauozmt2u
Reducing Transformer Depth on Demand with Structured Dropout
[article]
2019
arXiv
pre-print
In this work, we explore LayerDrop, a form of structured dropout, which has a regularization effect during training and allows for efficient pruning at inference time. ...
These models contain hundreds of millions of parameters, necessitating a large amount of computation and making them prone to overfitting. ...
A.1.2 LANGUAGE MODELING Training: To handle the large vocabulary of Wikitext-103, we follow Dauphin et al. (2017) and Baevski & Auli (2018) in using adaptive softmax and adaptive input for computational ...
arXiv:1909.11556v1
fatcat:yhf6lreaz5alhdq3rhl2ga77su
Accelerating Natural Language Understanding in Task-Oriented Dialog
[article]
2020
arXiv
pre-print
In this work, we show that a simple convolutional model compressed with structured pruning achieves largely comparable results to BERT on ATIS and Snips, with under 100K parameters. ...
Task-oriented dialog models typically leverage complex neural architectures and large-scale, pre-trained Transformers to achieve state-of-the-art performance on popular natural language understanding benchmarks ...
Distillation achieves similar results as structured pruning with 0-50% sparsity, but its performance largely drops off after 80%. ...
arXiv:2006.03701v1
fatcat:is2dx34gtndhjdrylgy5n233gm
Structured Pruning of a BERT-based Question Answering Model
[article]
2021
arXiv
pre-print
The recent trend in industry-setting Natural Language Processing (NLP) research has been to operate large %scale pretrained language models like BERT under strict computational limits. ...
In this paper, we investigate compressing BERT- and RoBERTa-based question answering systems by structured pruning of parameters from the underlying transformer model. ...
Introduction While knowledge distillation from large pretrained language models (e.g. ...
arXiv:1910.06360v3
fatcat:bkjuy3q7xnfgha4yviwgbnor54
Page 225 of Computational Linguistics Vol. 33, Issue 2
[page]
2007
Computational Linguistics
Feature Weight
language model (large) 1.00 language model (bitext) 1.03 P(y | «) 0.155 P(x | y) 1.23 Ply | x) 1.61 P.. ...
, as well as rules that contain multiple lexical items instead of one, an m-gram model whose structure cuts across the structure of context-free derivations, and large amounts of training data for meaningful ...
Language Adaptive Cross-lingual Speech Representation Learning with Sparse Sharing Sub-networks
[article]
2022
arXiv
pre-print
However, standard XLSR model suffers from language interference problem due to the lack of language specific modeling ability. In this work, we investigate language adaptive training on XLSR models. ...
It makes room for language specific modeling by pruning out unimportant parameters for each language, without requiring any manually designed language specific component. ...
The structure of adapter module follows [22] , and the projection dimension is set to 256 for base and large model. ...
arXiv:2203.04583v1
fatcat:yl6h2naqazhxzntaoznoeznjcm
Reweighted Proximal Pruning for Large-Scale Language Representation
[article]
2019
arXiv
pre-print
In this paper, we propose Reweighted Proximal Pruning (RPP), a new pruning method specifically designed for a large-scale language representation model. ...
Is it possible to compress these large-scale language representation models? How will the pruned language representation affect the downstream multi-task transfer learning objectives? ...
This is necessary in the weight pruning of super-deep language representation models. ...
arXiv:1909.12486v2
fatcat:2vu6giuusrc35pq25pia2vbk4e
Block Pruning For Faster Transformers
[article]
2021
arXiv
pre-print
Our approach extends structured methods by considering blocks of any size and integrates this structure into the movement pruning paradigm for fine-tuning. ...
We find that this approach learns to prune out full components of the underlying model, such as attention heads. ...
LG] 10 Sep 2021 There has been a growing interest in the compression of pre-trained language models. We consider three varieties of methods: distillation, pruning, and structured pruning. ...
arXiv:2109.04838v1
fatcat:44uzhne4lndfzeesdhqevdg2cm
On Compressing N-Gram Language Models
2007
2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07
INTRODUCTION The major part of memory consumption of large-vocabulary continuous speech recognition systems is usually due to the size of statistical language models. ...
BASELINE STRUCTURE
Back-off language model In the rest of the paper, we assume that the language model is represented in a common back-off format. ...
doi:10.1109/icassp.2007.367228
dblp:conf/icassp/Hirsimaki07
fatcat:tvlojckk3zda3fekg3vjh7r73e
An Approach to Pruning Metamodels like UML
2017
Proceedings of the 5th International Conference on Model-Driven Engineering and Software Development
There are a large number of modeling languages based on metamodels, and many of the languages are large and complex. In many cases, only part of a metamodel is needed. ...
By deeply analyzing the characteristics such as special relations between packages and step-by-step strictly defining mechanism of modeling concepts, this paper presents an approach to pruning metamodels ...
ACKNOWLEDGEMENTS The work supported by the National Natural Science Foundation of China (No. 61672046). ...
doi:10.5220/0006144004090417
dblp:conf/modelsward/Ma17
fatcat:yuq42ayw4nd3hex5xi4lc3oaky
Structured Sparsification of Gated Recurrent Neural Networks
2020
PROCEEDINGS OF THE THIRTIETH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE AND THE TWENTY-EIGHTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE
We test our approach on the text classification and language modeling tasks. Our method improves the neuron-wise compression of the model in most of the tasks. ...
We also observe that the resulting structure of gate sparsity depends on the task and connect the learned structures to the specifics of the particular tasks. ...
For the large model ( fig. 7) , the structure is slightly different than for the small model. ...
doi:10.1609/aaai.v34i04.5938
fatcat:qfixhxyojbextd77pvxsvhk6yy
NxMTransformer: Semi-Structured Sparsification for Natural Language Understanding via ADMM
[article]
2021
arXiv
pre-print
Recently, hardware manufacturers have introduced dedicated hardware for NxM sparsity to provide the flexibility of unstructured pruning with the runtime efficiency of structured approaches. ...
To address such an issue in a principled manner, we introduce a new learning framework, called NxMTransformer, to induce NxM semi-structured sparsity on pretrained language models for natural language ...
Acknowledgments and Disclosure of Funding We thank the anonymous NeurIPS reviewers for their constructive comments. ...
arXiv:2110.15766v1
fatcat:4q72ovgalbcbnmmuuakrn4paze
Compression of Deep Learning Models for Text: A Survey
[article]
2021
arXiv
pre-print
of such models to enable their deployment in real industry NLP projects.Given the critical need of building applications with efficient and small models, and the large amount of recently published work ...
In recent years, the fields of natural language processing (NLP) and information retrieval (IR) have made tremendous progress thanksto deep learning models like Recurrent Neural Networks (RNNs), Gated ...
While weight pruning theoretically leads to pruning to a large extent, practical implementation of sparse data structures is difficult. Pruning and regularization need to be done together carefully. ...
arXiv:2008.05221v4
fatcat:6frf2wzi7zganaqgkuvy4szgmq
« Previous
Showing results 1 — 15 out of 46,519 results