A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2022; you can also visit the original URL.
The file type is application/pdf
.
Efficient Design Space Exploration for Sparse Mixed Precision Neural Architectures
2022
Proceedings of the 31st International Symposium on High-Performance Parallel and Distributed Computing
Pruning and Quantization are two effective Deep Neural Network (DNN) compression methods for efficient inference on various hardware platforms. Pruning refers to removing unimportant weights or nodes, whereas Quantization converts the floating-point parameters to low-bit fixed integer representation. The pruned and low precision models result in smaller and faster inference models on hardware platforms with almost the same accuracy as the unoptimized network. Tensor Cores in Nvidia Ampere 100
doi:10.1145/3502181.3531463
fatcat:2xdjyoyhjfgozdtqpcz4cbmfqm