Optimal real number codes for fault tolerant matrix operations

Zizhong Chen
2009 Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis - SC '09  
It has been demonstrated recently that single fail-stop process failure in ScaLAPACK matrix multiplication can be tolerated without checkpointing. Multiple simultaneous processor failures can be tolerated without checkpointing by encoding matrices using a real-number erasure correcting code. However, the floating-point representation of a real number in today's high performance computer architecture introduces round off errors which can be enlarged and cause the loss of precision of possibly
more » ... effective digits during recovery when the number of processors in the system is large. In this paper, we present a class of Reed-Solomon style real-number erasure correcting codes which have optimal numerical stability during recovery. We analytically construct the numerically best erasure correcting codes for 2 erasures and develop an approximation method to computationally construct numerically good codes for 3 or more erasures. Experimental results demonstrate that the proposed codes are numerically much more stable than existing codes.
doi:10.1145/1654059.1654089 dblp:conf/sc/Chen09 fatcat:ud4ruwqgkvbkxnvkncahmcphem