Rethinking binary neural network design for FPGA implementation

Erwei Wang, Peter Cheung
Research has shown that deep neural networks contain significant redundancy, and that high classification accuracy can be achieved even when weights and activations are quantised down to binary values. Network binarisation on FPGAs greatly increases area efficiency by replacing resource-hungry multipliers with lightweight XNOR gates during inference. However, an FPGA's fundamental building block, the K-LUT, is capable of implementing far more than an XNOR: it can perform any K-input Boolean
more » ... ation. Inefficiency has also been spotted in BNN training: high-precision gradients and intermediate activations become redundant because we only care about weights' signs. My PhD focusses around increasing the efficiency of BNN inference and training on FPGAs. For inference, I propose expanding BNN's inference operator to utilize LUTs' full expressiveness. I also found various redundancies in the standard BNN training method, and proposed improvements to reduce them. With the promising improvements in area, energy and memory efficiency demonstrated in my works, my research makes BNN a more promising architecture for resource-constrained AI deployment. To make BNNs embrace the full capabilities of the LUT, I propose LUTNet, an end-to-end hardware-software framework for the construction of area-efficient FPGA-based neural network accelerators using the native LUTs as inference operators. I demonstrate that the exploitation of LUT flexibility allows for far heavier pruning than possible in prior works, resulting in significant area savings while achieving comparable accuracy, when implemented on a single fully-unrolled layer. Against the state-of-the-art binarised neural network implementation, I achieve twice the area efficiency for several standard network models when inferencing popular datasets. I also demonstrate that even greater energy efficiency improvements are obtainable. Although implementing just one network layer using the unrolled LUTNet architecture leads to significant area efficiency gains for a given modern [...]
doi:10.25560/97984 fatcat:invz4zzte5ah5dx42cu2sfuthi