Bigger Faster: Two-stage Neural Architecture Search for Quantized Transformer Models [article]

Yuji Chai, Luke Bailey, Yunho Jin, Matthew Karle, Glenn G. Ko
2022 arXiv   pre-print
Neural architecture search (NAS) for transformers has been used to create state-of-the-art models that target certain latency constraints. In this work we present Bigger&Faster, a novel quantization-aware parameter sharing NAS that finds architectures for 8-bit integer (int8) quantized transformers. Our results show that our method is able to produce BERT models that outperform the current state-of-the-art technique, AutoTinyBERT, at all latency targets we tested, achieving up to a 2.68%
more » ... y gain. Additionally, although the models found by our technique have a larger number of parameters than their float32 counterparts, due to their parameters being int8, they have significantly smaller memory footprints.
arXiv:2209.12127v1 fatcat:zf7glukglbcy3gbvb4isok7xlu