Assessment and Improvement of Hang Detection in the Linux Operating System

Domenico Cotroneo, Roberto Natella, Stefano Russo
2009 2009 28th IEEE International Symposium on Reliable Distributed Systems  
We propose a fault injection framework to assess hang detection facilities within the Linux Operating System (OS). The novelty of the framework consists in the adoption of a more representative faultload than existing ones, and in the effectiveness in terms of number of hang failures produced; representativeness is supported by a field data study on the Linux OS. Using the proposed fault injection framework, along with realistic workloads, we find that the Linux OS is unable to detect hangs in
more » ... everal cases. We experience a relative coverage of 75%. To improve detection facilities, we propose a simple yet effective hang detector, which periodically tests OS liveness, as perceived by applications, by means of I/O system calls; it is shown that this approach can improve relative coverage up to 94%. The hang detector can be deployed on any Linux system, with an acceptable overhead.
doi:10.1109/srds.2009.26 dblp:conf/srds/CotroneoNR09 fatcat:4imj6vrvyvcfpjeik4obfc2w3e