Hybrid Hardware/Software Floating-Point Implementations for Optimized Area and Throughput Tradeoffs

Jon J. Pimentel, Brent Bohnenstiehl, Bevan M. Baas
2017 IEEE Transactions on Very Large Scale Integration (vlsi) Systems  
Hybrid floating-point (FP) implementations improve software FP performance without incurring the area overhead of full hardware FP units. The proposed implementations are synthesized in 65-nm CMOS and integrated into small fixed-point processors with a RISC-like architecture. Unsigned, shift carry, and leading zero detection (USL) support is added to a processor to augment an existing instruction set architecture and increase FP throughput with little area overhead. The hybrid implementations
more » ... th USL support increase software FP throughput per core by 2.18× for addition/subtraction, 1.29× for multiplication, 3.07-4.05× for division, and 3.11-3.81× for square root, and use 90.7-94.6% less area than dedicated fused multiplyadd (FMA) hardware. Hybrid implementations with custom FP-specific hardware increase throughput per core over a fixedpoint software kernel by 3.69-7.28× for addition/subtraction, 1.22-2.03× for multiplication, 14.4× for division, and 31.9× for square root, and use 77.3-97.0% less area than dedicated FMA hardware. The circuit area and throughput are found for 38 multiply-add, 8 addition/subtraction, 6 multiplication, 45 division, and 45 square root designs. Thirty-three multiplyadd implementations are presented, which improve throughput per core versus a fixed-point software implementation by 1.11-15.9× and use 38.2-95.3% less area than dedicated FMA hardware. Index Terms-Arithmetic and logic structures, computer arithmetic, fine-grained system, floating point (FP).
doi:10.1109/tvlsi.2016.2580142 fatcat:zpc5nsuaobbhrhamz7ttyilipq