Filters








22 Hits in 3.9 sec

I-BERT: Integer-only BERT Quantization [article]

Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer
2021 arXiv   pre-print
In this work, we propose I-BERT, a novel quantization scheme for Transformer based models that quantizes the entire inference with integer-only arithmetic.  ...  Based on lightweight integer-only approximation methods for nonlinear operations, e.g., GELU, Softmax, and Layer Normalization, I-BERT performs an end-to-end integer-only BERT inference without any floating  ...  CONCLUSIONS We have proposed I-BERT, a novel integer-only quantization scheme for Transformers, where the entire inference is performed with pure integer arithmetic.  ... 
arXiv:2101.01321v3 fatcat:dhja6v44hnha5gvbzctmqcxwtq

NN-LUT: Neural Approximation of Non-Linear Operations for Efficient Transformer Inference [article]

Joonsang Yu, Junki Park, Seongmin Park, Minsoo Kim, Sihwa Lee, Dong Hyun Lee, Jungwook Choi
2021 arXiv   pre-print
The proposed framework called NN-LUT can accurately replace all the non-linear operations in popular BERT models with significant reductions in area, power consumption, and latency.  ...  Several prior works simplified these operations with look-up tables or integer computations, but such approximations suffer inferior accuracy or considerable hardware cost with long latency.  ...  I-BERT: Integer-only BERT Quantizatio. In In ICML. [23] H. Wang, Z. Zhang, and S. Han. 2021. SpAtten: Efficient Sparse Attention Archi- [13] Y. Liu et al. 2019.  ... 
arXiv:2112.02191v1 fatcat:zb3idsytubdfpawg7mx7x7txhq

NN-LUT

Joonsang Yu, Junki Park, Seongmin Park, Minsoo Kim, Sihwa Lee, Dong Hyun Lee, Jungwook Choi
2022 Proceedings of the 59th ACM/IEEE Design Automation Conference  
Several prior works simplified these operations with look-up tables or integer computations, but such approximations suffer inferior accuracy or considerable hardware cost with long latency.  ...  The proposed framework called Neural network generated LUT(NN-LUT ) can accurately replace all the non-linear operations in popular BERT models with significant reductions in area, power consumption, and  ...  INT32: adopt the scaling-factor calculation of I-BERT to quantize FP32 values into INT32 directly In I-BERT implementation at https://github.com/kssteven418/I-BERT, the integer input of non-linear  ... 
doi:10.1145/3489517.3530505 fatcat:rmhxntcylzcehm7euxvyv7cszu

Integer Fine-tuning of Transformer-based Models [article]

Mohammadreza Tayaranian, Alireza Ghaffari, Marzieh S. Tahaei, Mehdi Rezagholizadeh, Masoud Asgharian, Vahid Partovi Nia
2022 arXiv   pre-print
We fine-tune BERT and ViT models on popular downstream tasks using integer layers. We show that 16-bit integer models match the floating-point baseline performance.  ...  Furthermore, we study the effect of various integer bit-widths to find the minimum required bit-width for integer fine-tuning of transformer-based models.  ...  (Kim et al. 2021) proposed I-BERT that uses a uniform quantization scheme to quantize input activations and weights of various components of BERT.  ... 
arXiv:2209.09815v1 fatcat:3sv7rlavhbcahce7a3nkym6em4

A Survey on Model Compression for Natural Language Processing [article]

Canwen Xu, Julian McAuley
2022 arXiv   pre-print
I-BERT [Kim et al., 2021] eliminates floating point calculation in BERT by exploiting lightweight integer-only approximation methods for non-linear operations (e.g., GELU, Softmax and LayerNorm).  ...  The resulted I-BERT model is capa- ble of doing pure INT8 inference thus has a better acceleration ratio. https://pytorch.org/docs/stable/sparse.html  ... 
arXiv:2202.07105v1 fatcat:4l5hk76o7fc37k6pttrhngpuoy

Bigger Faster: Two-stage Neural Architecture Search for Quantized Transformer Models [article]

Yuji Chai, Luke Bailey, Yunho Jin, Matthew Karle, Glenn G. Ko
2022 arXiv   pre-print
In this work we present Bigger&Faster, a novel quantization-aware parameter sharing NAS that finds architectures for 8-bit integer (int8) quantized transformers.  ...  Our results show that our method is able to produce BERT models that outperform the current state-of-the-art technique, AutoTinyBERT, at all latency targets we tested, achieving up to a 2.68% accuracy  ...  In 2021, I-BERT found integer-only quantization of BERT can achieve an inference speed-up of up to 4 times with minimal impact to test accuracies [20] .  ... 
arXiv:2209.12127v1 fatcat:zf7glukglbcy3gbvb4isok7xlu

FQ-ViT: Post-Training Quantization for Fully Quantized Vision Transformer [article]

Yang Lin, Tianyu Zhang, Peiqin Sun, Zheng Li, Shuchang Zhou
2022 arXiv   pre-print
However, most existing quantization methods have been developed mainly on Convolutional Neural Networks (CNNs), and suffer severe degradation when applied to fully quantized vision transformers.  ...  Network quantization significantly reduces model inference complexity and has been widely used in real-world deployments.  ...  I-BERT use a second-order polynomial to approximate the exponential function in the interval of p ∈ (− ln 2, 0]: L(p) = 0.3585(p + 1.353) 2 + 0.344 ≈ exp(p), (39) exp (x) ≈ i-exp(x) := L(p) >> z, (40)  ... 
arXiv:2111.13824v3 fatcat:ghcwzezqkbc5tmb267wqxeogby

A Survey on Green Deep Learning [article]

Jingjing Xu, Wangchunshu Zhou, Zhiyi Fu, Hao Zhou, Lei Li
2021 arXiv   pre-print
Massive computations not only have a surprisingly large carbon footprint but also have negative effects on research inclusiveness and deployment on real-world applications.  ...  Following this work, proposed I-BERT, which employed lightweight integer-only approximation methods for nonlinear operations to quantize BERT. proposed TernaryBERT, which ternarized the weights in a fine-tuned  ...  For example, BERT requires massive computations during training.If we only consider training costs, BERT can not be treated as a Green technology.However, BERT can improve downstream performance with fewer  ... 
arXiv:2111.05193v2 fatcat:t2blz24y2jakteeeawqqogbkpy

Enable Deep Learning on Mobile Devices: Methods, Systems, and Applications

Han Cai, Ji Lin, Yujun Lin, Zhijian Liu, Haotian Tang, Hanrui Wang, Ligeng Zhu, Song Han
2022 ACM Transactions on Design Automation of Electronic Systems  
We start from introducing popular model compression methods, including pruning, factorization, quantization, as well as compact model design.  ...  To reduce the large design cost of these manual solutions, we discuss the AutoML framework for each of them, such as neural architecture search (NAS) and automated pruning and quantization.  ...  I-BERT [147] proposes integer-only BERT. Ternary-BERT [349] and BinaryBERT [11] further quantize the model down to Ternary ({-1, 0, +1}) and binary schemes. Tambe et al.  ... 
doi:10.1145/3486618 fatcat:h6xwv2slo5eklift2fl24usine

Bioformers: Embedding Transformers for Ultra-Low Power sEMG-based Gesture Recognition [article]

Alessio Burrello, Francesco Bianco Morghet, Moritz Scherer, Simone Benatti, Luca Benini, Enrico Macii, Massimo Poncino, Daniele Jahier Pagliari
2022 arXiv   pre-print
We follow the steps described in I-BERT [24] to replace the floating-point operators that compose MHSA layers with their int8 counterparts.  ...  We then perform few epochs of quantization aware training (QAT) to shift from fp32 to integer (int8) arithmetic.  ... 
arXiv:2203.12932v2 fatcat:dzcwjs5aqjb67iea72ekrvycg4

NLP From Scratch Without Large-Scale Pretraining: A Simple and Efficient Framework [article]

Xingcheng Yao, Yanan Zheng, Xiaocong Yang, Zhilin Yang
2022 arXiv   pre-print
Other works speed up inference by quantizing PLMs with low-precision representations during inference, such as Q8-BERT (Zafrir et al., 2019) , Q-BERT (Shen et al., 2020) , and I-BERT (Kim et al., 2021  ...  It only utilizes the general corpus as in BERT and RoBERTa. In comparison, domain-adaptive finetuning uses domain data to improve domain adaptation.  ... 
arXiv:2111.04130v2 fatcat:bsecajqdvnfrhlkhxxevmmc324

A Fast Attention Network for Joint Intent Detection and Slot Filling on Edge Devices [article]

Liang Huang, Senjie Liang, Feiyang Ye, Nan Gao
2022 arXiv   pre-print
For future work, we plan to employ an integer-only encoder, i.e., I-BERT [27] , to further reduce the inference latency.  ...  Furthermore, DistillBERT-FAN and TinyBERT-FAN are the only two models whose inference latency is less than 100ms. F.  ... 
arXiv:2205.07646v1 fatcat:db7yougi5fgozibvarratkccty

Learning Compact Metrics for MT [article]

Amy Pu, Hyung Won Chung, Ankur P. Parikh, Sebastian Gehrmann, Thibault Sellam
2021 arXiv   pre-print
Our method yields up to 10.5% improvement over vanilla fine-tuning and reaches 92.6% of RemBERT's performance using only a third of its parameters.  ...  I-bert: Integer-only bert quantization. arXiv preprint arXiv:2101.01321. Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization.  ...  A Training RemBERT for MT Evaluation A.1 RemBERT Pre-Training RemBERT is an encoder-only architecture, similar to BERT but with an optimized parameter allocation (Chung et al., 2021).  ... 
arXiv:2110.06341v1 fatcat:soizlnjjtjccvaynejghxwh3ky

Compressed Deep Networks: Goodbye SVD, Hello Robust Low-Rank Approximation [article]

Murad Tukan and Alaa Maalouf and Matan Weksler and Dan Feldman
2020 arXiv   pre-print
Extensive experimental results on the GLUE benchmark for compressing BERT, DistilBERT, XLNet, and RoBERTa confirm this theoretical advantage.  ...  For example, our approach achieves 28% compression of RoBERTa's embedding layer with only 0.63% additive drop in the accuracy (without fine-tuning) in average over all tasks in GLUE, compared to 11% drop  ...  We compress several frequently used NLP networks: (i) BERT (Devlin et al. 2019 . 2017) ).  ... 
arXiv:2009.05647v2 fatcat:f5si5su2mzdv5eas73cslrtfhq

Mokey: Enabling Narrow Fixed-Point Inference for Out-of-the-Box Floating-Point Transformer Models [article]

Ali Hadi Zadeh, Mostafa Mahmoud, Ameer Abdelhadi, Andreas Moshovos
2022 pre-print
Mokey proves superior to prior state-of-the-art quantization methods for Transformers.  ...  Mokey reduces the footprint of state-of-the-art 32-bit or 16-bit floating-point transformer models by quantizing all values to 4-bit indexes into dictionaries of representative 16-bit fixed-point centroids  ...  Table IV compares Q8BERT and I-BERT with Mokey.  ... 
doi:10.1145/3470496.3527438 arXiv:2203.12758v1 fatcat:2iowyy5pz5b2xpzqvg6xtaxygy
« Previous Showing results 1 — 15 out of 22 results