Hybrid Hardware/Software Floating-Point Implementations for Optimized Area and Throughput Tradeoffs
IEEE Transactions on Very Large Scale Integration (vlsi) Systems
Hybrid floating-point (FP) implementations improve software FP performance without incurring the area overhead of full hardware FP units. The proposed implementations are synthesized in 65-nm CMOS and integrated into small fixed-point processors with a RISC-like architecture. Unsigned, shift carry, and leading zero detection (USL) support is added to a processor to augment an existing instruction set architecture and increase FP throughput with little area overhead. The hybrid implementations
... th USL support increase software FP throughput per core by 2.18× for addition/subtraction, 1.29× for multiplication, 3.07-4.05× for division, and 3.11-3.81× for square root, and use 90.7-94.6% less area than dedicated fused multiplyadd (FMA) hardware. Hybrid implementations with custom FP-specific hardware increase throughput per core over a fixedpoint software kernel by 3.69-7.28× for addition/subtraction, 1.22-2.03× for multiplication, 14.4× for division, and 31.9× for square root, and use 77.3-97.0% less area than dedicated FMA hardware. The circuit area and throughput are found for 38 multiply-add, 8 addition/subtraction, 6 multiplication, 45 division, and 45 square root designs. Thirty-three multiplyadd implementations are presented, which improve throughput per core versus a fixed-point software implementation by 1.11-15.9× and use 38.2-95.3% less area than dedicated FMA hardware. Index Terms-Arithmetic and logic structures, computer arithmetic, fine-grained system, floating point (FP).