Adaptive and Power-Aware Resilience for Extreme-Scale Computing

Xiaolong Cui, Taieb Znati, Rami Melhem
2016 2016 Intl IEEE Conferences on Ubiquitous Intelligence & Computing, Advanced and Trusted Computing, Scalable Computing and Communications, Cloud and Big Data Computing, Internet of People, and Smart World Congress (UIC/ATC/ScalCom/CBDCom/IoP/SmartWorld)  
With concerted efforts from researchers in hardware, software, algorithm, and resource management, HPC is moving towards extreme-scale, featuring a computing capability of exaFLOPS. As we approach the new era of computing, however, several daunting scalability challenges remain to be conquered. Delivering extreme-scale performance will require a computing platform that supports billion-way parallelism, necessitating a dramatic increase in the number of computing, storage, and networking
more » ... ts. At such a large scale, failure would become a norm rather than an exception, driving the system to significantly lower efficiency with unprecedented amount of power consumption. To tackle these challenges, we propose an adaptive and power-aware algorithm, referred to as Lazy Shadowing, as an efficient and scalable approach to achieve high-levels of resilience, through forward progress, in extreme-scale, failureprone computing environments. Lazy Shadowing associates with each process a "shadow" (process) that executes at a reduced rate, and opportunistically rolls forward each shadow to catch up with its leading process during failure recovery. Compared to existing fault tolerance methods, our approach can achieve 20% energy saving with potential reduction in solution time at scale.
doi:10.1109/uic-atc-scalcom-cbdcom-iop-smartworld.2016.0111 dblp:conf/uic/CuiZM16 fatcat:qw3ld2anjvg7dnbbfbu3ydmmta