Filtering Failure Logs for a BlueGene/L Prototype

Yinglung Liang, Yanyong Zhang, A. Sivasubramaniam, R.K. Sahoo, J. Moreira, M. Gupta
2005 International Conference on Dependable Systems and Networks (DSN'05)  
The growing computational and storage needs of several scientific applications mandate the deployment of extremescale parallel machines, such as IBM's BlueGene/L which can accommodate as many as 128K processors. In this paper, we present our experiences in collecting and filtering error event logs from a 8192 processor BlueGene/L prototype at IBM Rochester, which is currently ranked #8 in the Top-500 list. We analyze the logs collected from this machine over a period of 84 days starting from
more » ... ust 26, 2004. We perform a three-step filtering algorithm on these logs: extracting and categorizing failure events; temporal filtering to remove duplicate reports from the same location; and finally coalescing failure reports of the same error across different locations. Using this approach, we can substantially compress these logs, removing over 99.96% of the 828,387 original entries, and more accurately portray the failure occurrences on this system.
doi:10.1109/dsn.2005.50 dblp:conf/dsn/LiangZSSMG05 fatcat:hribyyz6pnh57dhgpoqiewpmcu