Neural Network Quantization with Scale-Adjusted Training
British Machine Vision Conference
Quantization has long been studied as a compression and accelerating technique for deep neural networks due to its potential on reducing model size and computational costs, for both general hardware, such as DSP, CPU or GPU, and customized devices with flexible bit-width configurations, including FPGA and ASIC. However, previous works generally achieve network quantization by sacrificing on prediction accuracy with respect to their full-precision counterparts. In this paper, we investigate the
... nderlying mechanism of such performance degeneration based on previous work of parameterized clipping activation (PACT). We find that the key factor is the weight scale in the last layer. Instead of aligning weight distributions of quantized and full-precision models, as generally suggested in the literature, the main issue is that large scale can cause overfitting problem. We propose a technique called scale-adjusted training (SAT) by directly scaling down weights in the last layer to alleviate such over-fitting. With the proposed technique, quantized networks can demonstrate better performance than their full-precision counter-parts, and we achieve state-of-the-art accuracy with consistent improvement over previous quantization methods for light weight models including MobileNet V1/V2 on ImageNet classification. * PACT and LSQ use full pre-activation ResNet, QIL and SAT use vanilla ResNet, and LQNet uses vanilla ResNet without convolution operation in shortcut (type-A shortcut). † PACT, LQNet and QIL use full-precision for the first and last layers, while LSQ and SAT use 8bit for both layers .