A user-level library for fault tolerance on shared memory multicore systems

Hamid Mushtaq, Zaid Al-Ars, Koen Bertels
2012 2012 IEEE 15th International Symposium on Design and Diagnostics of Electronic Circuits & Systems (DDECS)  
The ever decreasing transistor size has made it possible to integrate multiple cores on a single die. On the downside, this has introduced reliability concerns as smaller transistors are more prone to both transient and permanent faults. However, the abundant extra processing resources of a multicore system can be exploited to provide fault tolerance by using redundant execution. We have designed a library for multicore processing, that can make a multithreaded user-level application fault
more » ... ant by simple modifications to the code. It uses the abundant cores found in the system to perform redundant execution for error detection. Besides that, it also allows recovery through checkpoint/rollback. Our library is portable since it does not depend on any special hardware. Furthermore, the overhead (up to 46% for 4 threads), our library adds to the original application, is less than other existing approaches, such as Respec.
doi:10.1109/ddecs.2012.6219071 dblp:conf/ddecs/MushtaqAB12 fatcat:25bj6d6jzne4bhjibvn6vvywcy