1,287 Hits in 6.6 sec

Enabling Lightweight Fine-tuning for Pre-trained Language Model Compression based on Matrix Product Operators

Peiyu Liu, Ze-Feng Gao, Wayne Xin Zhao, Zhi-Yuan Xie, Zhong-Yi Lu, Ji-Rong Wen
2021 Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)   unpublished
This paper presents a novel pre-trained language models (PLM) compression approach based on the matrix product operator (short as MPO) from quantum many-body physics.  ...  With the decomposed MPO structure, we propose a novel fine-tuning strategy by only updating the parameters from the auxiliary tensors, and design an optimization algorithm for MPO-based approximation over  ...  Next, we study how to perform lightweight fine-tuning based on MPO properties. Parameter Variation from Pre-Training.  ... 
doi:10.18653/v1/2021.acl-long.418 fatcat:643kwego6bfttn7o6fqeolpgzi

LadaBERT: Lightweight Adaptation of BERT through Hybrid Model Compression [article]

Yihuan Mao, Yujing Wang, Chufan Wu, Chen Zhang, Yang Wang, Yaming Yang, Quanlu Zhang, Yunhai Tong, Jing Bai
2020 arXiv   pre-print
BERT is a cutting-edge language representation model pre-trained by a large corpus, which achieves superior performances on various natural language understanding tasks.  ...  In this paper, we address this issue by proposing a hybrid solution named LadaBERT (Lightweight adaptation of BERT through hybrid model compression), which combines the advantages of different model compression  ...  Ideally, people can start from a pre-trained BERT checkpoint and fine-tune it on a specific downstream task.  ... 
arXiv:2004.04124v2 fatcat:sftc2oxxeff6bofyweouredjn4

Blockwise Self-Attention for Long Document Understanding [article]

Jiezhong Qiu, Hao Ma, Omer Levy, Scott Wen-tau Yih, Sinong Wang, Jie Tang
2020 arXiv   pre-print
We conduct experiments on language model pre-training and several benchmark question answering datasets with various paragraph lengths.  ...  We present BlockBERT, a lightweight and efficient BERT model for better modeling long-distance dependencies.  ...  These QA datasets have different paragraph length distribution patterns and are thus ideal for testing the effectiveness of BlockBERT. For example, SQuAD, NaturalQA, and HotpotQA consist of  ... 
arXiv:1911.02972v2 fatcat:o7i2dcczdva7rbmqiglydndfre

A Closer Look at Self-supervised Lightweight Vision Transformers [article]

Shaoru Wang, Jin Gao, Zeming Li, Jian Sun, Weiming Hu
2022 arXiv   pre-print
In this work, we mainly produce recipes for pre-training high-performance lightweight ViTs using masked-image-modeling-based MAE, namely MAE-lite, which achieves 78.4% top-1 accuracy on ImageNet with ViT-Tiny  ...  We analyze and clearly show the effect of such pre-training, and reveal that properly-learned lower layers of the pre-trained models matter more than higher ones in data-sufficient downstream tasks.  ...  For all of the pre-trained models, we fine-tune them for 300 epochs on IN1k for fair comparisons.  ... 
arXiv:2205.14443v1 fatcat:fd4s2jrj7jgflfxlyynn4qgkwe

Parameter-Efficient Mixture-of-Experts Architecture for Pre-trained Language Models [article]

Ze-Feng Gao, Peiyu Liu, Wayne Xin Zhao, Zhong-Yi Lu, Ji-Rong Wen
2022 arXiv   pre-print
We adopt the matrix product operator (MPO, a tensor decomposition from quantum many-body physics) to reconstruct the parameter matrix in the expert layer and increase model capacity for pre-trained language  ...  Extensive experiments based on T5 and GPT-2 show improved performance and efficiency of the pre-trained language model (27.2x reduction in total parameters for the superior model performance, compared  ...  Furthermore, the MPO decomposition was used to compress the PLMs as well as enable lightweight fine-tuning in downstream tasks .  ... 
arXiv:2203.01104v3 fatcat:julsxmpjfrdmvheztzorxioxtu

The NLP Cookbook: Modern Recipes for Transformer Based Deep Learning Architectures

Sushant Singh, Ausif Mahmood
2021 IEEE Access  
In this paper, we summarize and examine the current state-of-the-art (SOTA) NLP models that have been employed for numerous NLP tasks for optimal performance and efficiency.  ...  Recent research has also focused on superior inference by providing efficient attention to longer input sequences.  ...  For instance, fine-tuning a language pair i.e. (German-English) enables the model to translate from any language in the monolingual pre-training set i.e. (French English), without further training.  ... 
doi:10.1109/access.2021.3077350 fatcat:gchmms4m2ndvzdowgrvro3w6z4

The NLP Cookbook: Modern Recipes for Transformer based Deep Learning Architectures [article]

Sushant Singh, Ausif Mahmood
2021 arXiv   pre-print
In this paper, we summarize and examine the current state-of-the-art (SOTA) NLP models that have been employed for numerous NLP tasks for optimal performance and efficiency.  ...  Recent research has also focused on superior inference by providing efficient attention to longer input sequences.  ...  For instance, fine-tuning a language pair i.e. (German-English) enables the model to translate from any language in the monolingual pre-training set i.e. (French English), without further training.  ... 
arXiv:2104.10640v3 fatcat:ctuyddhm3baajk5uqrynwdap44

Algorithm to Compilation Co-design: An Integrated View of Neural Network Sparsity [article]

Fu-Ming Guo, Austin Huang
2021 arXiv   pre-print
Integration of BSR operations enables the TVM runtime execution to leverage structured pattern sparsity induced by model regularization.  ...  This integrated view of pruning algorithms enables us to study relationships between modeling decisions and their direct impact on sparsity-enhanced execution.  ...  Input/Output representations: We follow the input/output representation setting from Devlin et al. [2019] for both pre-training and fine-tuning.  ... 
arXiv:2106.08846v2 fatcat:bfx3lvvpzvbbffzeqic2g4ffye

Post-training deep neural network pruning via layer-wise calibration [article]

Ivan Lazarevich and Alexander Kozlov and Nikita Malinin
2021 arXiv   pre-print
We propose a data-free extension of the approach for computer vision models based on automatically-generated synthetic fractal images.  ...  We present a post-training weight pruning method for deep neural networks that achieves accuracy levels tolerable for the production setting and that is sufficiently fast to be run on commodity hardware  ...  the fine-tuned model accuracy for a ResNet18 model at 50% sparsity on ImageNet.  ... 
arXiv:2104.15023v1 fatcat:o67pulxvsncnloartcg6d7uidi

From Word Embeddings to Pre-Trained Language Models: A State-of-the-Art Walkthrough

Mourad Mars
2022 Applied Sciences  
With the recent advances in deep learning, different approaches to improving pre-trained language models (PLMs) have been proposed.  ...  Then, we analyse and contrast the various models and provide an analysis of the way they have been built (number of parameters, compression techniques, etc.).  ...  We also would like to thank anonymous reviewers for their constructive feedback on the initial manuscript. Conflicts of Interest: The authors declare no conflict of interest.  ... 
doi:10.3390/app12178805 fatcat:sjdjsrgjxberbay2jfa7o6j63q

On the Usability of Transformers-based models for a French Question-Answering task [article]

Oralie Cattan, Christophe Servan, Sophie Rosset
2022 arXiv   pre-print
of pre-trained language models.  ...  For many tasks, state-of-the-art results have been achieved with Transformer-based architectures, resulting in a paradigmatic shift in practices from the use of task-specific architectures to the fine-tuning  ...  Pre-training large models on massive corpora using unsupervised language modeling and fine-tuning the model with pre-trained weights requires less task-specific data.  ... 
arXiv:2207.09150v1 fatcat:u3upvtscw5dw3hejui46i5qhqy

A Survey on Green Deep Learning [article]

Jingjing Xu, Wangchunshu Zhou, Zhiyi Fu, Hao Zhou, Lei Li
2021 arXiv   pre-print
The target is to yield novel results with lightweight and efficient technologies. Many technologies can be used to achieve this goal, like model compression and knowledge distillation.  ...  In recent years, larger and deeper models are springing up and continuously pushing state-of-the-art (SOTA) results across various fields like natural language processing (NLP) and computer vision (CV)  ...  fine-tuned one.  ... 
arXiv:2111.05193v2 fatcat:t2blz24y2jakteeeawqqogbkpy

Machine Learning for Microcontroller-Class Hardware – A Review [article]

Swapnil Sayan Saha, Sandeep Singh Sandha, Mani Srivastava
2022 arXiv   pre-print
This paper highlights the unique requirements of enabling onboard machine learning for microcontroller class devices.  ...  We characterize a closed-loop widely applicable workflow of machine learning model development for microcontroller class devices and show that several classes of applications adopt a specific instance  ...  In some cases, the pre-trained model is too big to apply model compression feasibly for a microcontroller (e.g., in Table XIX, AlexNet can be reduced to 6.9 MB from 240 MB) or the pre-trained model may  ... 
arXiv:2205.14550v3 fatcat:y272riitirhwfgfiotlwv5i7nu

Amortized Neural Networks for Low-Latency Speech Recognition [article]

Jonathan Macoskey, Grant P. Strimel, Jinru Su, Ariya Rastrow
2021 arXiv   pre-print
Here, we achieve variable compute for two well-known candidate techniques: one using sparse pruning and the other using matrix factorization.  ...  The AmNets RNN-T architecture enables the network to dynamically switch between encoder branches on a frame-by-frame basis.  ...  AmNets models are first trained with Lavg loss, which is followed by a short fine-tuning stage using Lamr loss.  ... 
arXiv:2108.01553v1 fatcat:uci5hioqhbenvmjqmkjit624wa

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity [article]

William Fedus, Barret Zoph, Noam Shazeer
2022 arXiv   pre-print
Finally, we advance the current scale of language models by pre-training up to trillion parameter models on the "Colossal Clean Crawled Corpus" and achieve a 4x speedup over the T5-XXL model.  ...  We design models based off T5-Base and T5-Large to obtain up to 7x increases in pre-training speed with the same computational resources.  ...  Hugo Larochelle for sage advising and clarifying comments on the draft, Irwan Bello for detailed comments and careful revisions, Colin Raffel and Adam Roberts for timely advice on neural language models  ... 
arXiv:2101.03961v3 fatcat:jmgrr46hyzghhoxjlcrm6prhrq
« Previous Showing results 1 — 15 out of 1,287 results