Efficiency and scalability exploration of an application-specific instruction-set processor for deep convolutional neural networks [thesis]

Andreas Bytyn, Gerd Ascheid, Rainer Leupers
2020
With the increasing adoption of artificial neural networks (ANNs) for the realization of tasks like image classification, object detection, etc., an equally increased need for the efficient acceleration of these workloads has arisen. Today, especially convolutional neural networks (CNNs) have reached both the mainstream consumer market, as evident e.g. by many CNN-based image enhancement filters found in modern smartphone cameras, and also more traditional branches like the automotive industry,
more » ... which aims to use CNNs for pedestrian detection, traffic sign recognition, etc. Currently available off-the-shelf processing systems, however, do not provide both the desired degree of flexibility and energy efficiency that is required in energy and power constrained environments. This thesis provides methods and a proof of concept hardware implementation that help in bridging this efficiency-flexibility gap by jointly investigating algorithm, software, and hardware optimizations. Several algorithmic techniques that help to increase the overall processing efficiency are quantitatively investigated in this thesis, including different methods for the quantization of computations and for the compression of neural network parameters (pruning). To ensure that the overall results are representative, different CNNs using regular convolutional layers, as well as more advanced residual blocks and depthwise-separable convolutions are explored. Based on this in-depth analysis, an application-specific instruction-set processor (ASIP) was designed from scratch, which is capable of exploiting several of these efficiency enhancing techniques. Its arithmetic units were designed based on the previously mentioned quantization analysis in order to minimize their dynamic power consumption by employing subword-parallel multipliers with reduced wordwidth, which are also capable of exploiting the sparsity found in CNN computations. Many application-specific sub-modules were incorporated to ensure a high degree of utilization regarding the arithm [...]
doi:10.18154/rwth-2021-02253 fatcat:s4ohrfpl2zevvgbqzb36eulcca