A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2022; you can also visit the original URL.
The file type is application/pdf
.
The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for Large Language Models
[article]
2022
arXiv
pre-print
Transformer-based language models have become a key building block for natural language processing. While these models are extremely accurate, they can be too large and computationally intensive to run on standard deployments. A variety of compression methods, including distillation, quantization, structured and unstructured pruning are known to decrease model size and increase inference speed, with low accuracy loss. In this context, this paper's contributions are two-fold. We perform an
arXiv:2203.07259v3
fatcat:fllejhlfxvgkxi4qy2qacp4ikq