nuQmm: Quantized MatMul for Efficient Inference of Large-Scale Generative Language Models [article]

Gunho Park, Baeseong Park, Sungjae Lee, Minsub Kim, Byeongwook Kim, Se Jung Kwon, Youngjoo Lee, Dongsoo Lee
2022 arXiv   pre-print
The recent advance of self-supervised learning associated with the Transformer architecture enables natural language processing (NLP) to exhibit extremely low perplexity. Such powerful models demand ever-increasing model size and, thus, large amounts of computations and memory footprints. In this paper, we propose an efficient inference framework for large-scale generative language models. As the key to reducing model size, we quantize weights by a non-uniform quantization method. Then,
more » ... d matrix multiplications are accelerated by our proposed kernel, called nuQmm, which allows a wide trade-off between compression ratio and accuracy. Our proposed nuQmm reduces the latency of not only each GPU but also the entire inference of large LMs because a high compression ratio (by low-bit quantization) mitigates the minimum required number of GPUs. Assuming 2-bit quantization, we demonstrate that nuQmm can reduce latency to generate each token for OPT-175B (that requires 8 GPUs without nuQmm) by 47.3% using 8 GPUs or by 23.2% using only 2 GPUs.
arXiv:2206.09557v2 fatcat:sfkjkzy4obckfb7fskzy4ts3xi