Efficient Design Space Exploration for Sparse Mixed Precision Neural Architectures

Krishna Teja Chitty-Venkata, Murali Emani, Venkatram Vishwanath, Arun K. Somani
2022 Proceedings of the 31st International Symposium on High-Performance Parallel and Distributed Computing  
Pruning and Quantization are two effective Deep Neural Network (DNN) compression methods for efficient inference on various hardware platforms. Pruning refers to removing unimportant weights or nodes, whereas Quantization converts the floating-point parameters to low-bit fixed integer representation. The pruned and low precision models result in smaller and faster inference models on hardware platforms with almost the same accuracy as the unoptimized network. Tensor Cores in Nvidia Ampere 100
more » ... 100) GPU supports (1) 2:4 fine-grained sparse pruning where 2 out of every 4 elements are pruned, and (2) traditional dense multiplication to achieve a good accuracy and performance trade-off. The A100 Tensor Core also takes advantage of 1-bit, 4-bit, and 8-bit multiplication to speed up the inference of a model. Hence, finding the right matrix type (dense or 2:4 sparse) along with the precision for each layer becomes a combinatorial problem. Neural Architecture Search (NAS) can alleviate such problems by automating the architecture design process instead of a brute-force search. In this paper, we propose (i) Mixed Sparse and Precision Search (MSPS), a NAS framework to search for efficient sparse and mixed-precision quantized model within the predefined search space and fixed backbone neural network (Eg. ResNet50), and (ii) Architecture, Sparse and Precision Search (ASPS) to jointly search for kernel size and number of filters, and sparse-precision combination of each layer. We illustrate the effectiveness of our methods targeting A100 Tensor Core on Nvidia GPUs by searching efficient sparse-mixed precision networks on ResNet50 and achieving better accuracy-latency trade-off models compared to the manually designed Uniform Sparse Int8 networks. CCS CONCEPTS • Computing methodologies → Neural networks.
doi:10.1145/3502181.3531463 fatcat:2xdjyoyhjfgozdtqpcz4cbmfqm