Filters








63 Hits in 5.6 sec

Failure prediction in IBM BlueGene/L event logs

Yanyong Zhang, Anand Sivasubramaniam
2008 Proceedings, International Parallel and Distributed Processing Symposium (IPDPS)  
To this end, we have collected detailed event logs from IBM BlueGene/L, which has 128K processors, and is currently the fastest supercomputer in the world.  ...  In order to develop effective fault-tolerant strategies, there is a critical need to predict failure events.  ...  To address these issues, in this study, we derive our prediction models from the failure logs collected from IBM BlueGene/L over a period of 142 days.  ... 
doi:10.1109/ipdps.2008.4536397 dblp:conf/ipps/ZhangS08 fatcat:yhpij5kuxvgb7cksqjg2yag6ka

Failure Prediction in IBM BlueGene/L Event Logs

Yinglung Liang, Yanyong Zhang, Hui Xiong, Ramendra Sahoo
2007 Seventh IEEE International Conference on Data Mining (ICDM 2007)  
To this end, we have collected detailed event logs from IBM BlueGene/L, which has 128K processors, and is currently the fastest supercomputer in the world.  ...  In order to develop effective fault-tolerant strategies, there is a critical need to predict failure events.  ...  To address these issues, in this study, we derive our prediction models from the failure logs collected from IBM BlueGene/L over a period of 142 days.  ... 
doi:10.1109/icdm.2007.46 dblp:conf/icdm/LiangZXS07 fatcat:tgbmzaha5vcppln7ddbxq6nqrm

LogMaster: Mining Event Correlations in Logs of Large-Scale Cluster Systems

Xiaoyu Fu, Rui Ren, Jianfeng Zhan, Wei Zhou, Zhen Jia, Gang Lu
2012 2012 IEEE 31st Symposium on Reliable Distributed Systems  
severity, in logs of large-scale cloud and HPC systems.  ...  The experimental results on three logs of production cloud and HPC systems, varying from 433490 entries to 4747963 entries, show that our method can predict failures with a high precision and an acceptable  ...  Similar observations are found in the HPC cluster and BlueGene/L logs.  ... 
doi:10.1109/srds.2012.40 dblp:conf/srds/FuRZZJL12 fatcat:s64itbbeazagdc2kyzar2fyrpm

LogMaster: Mining Event Correlations in Logs of Large scale Cluster Systems [article]

Rui Ren, Xiaoyu Fu, Jianfeng Zhan, Wei Zhou
2013 arXiv   pre-print
in three scenarios: (a) predicting all events on the basis of both failure and non-failure events; (b) predicting only failure events on the basis of both failure and non-failure events; (c) predicting  ...  This paper presents a methodology and a system, named LogMaster, for mining correlations of events that have multiple attributions, i.e., node ID, application ID, event type, and event severity, in logs  ...  Some work uses statistical analysis approach to find simple temporal and spatial laws or models of system events [13] [3] [28] in large-scale cluster systems like Bluegene/L.  ... 
arXiv:1003.0951v2 fatcat:2ver5llwineone5ozmb42spxwa

Online Event Correlations Analysis in System Logs of Large-Scale Cluster Systems [chapter]

Wei Zhou, Jianfeng Zhan, Dan Meng, Zhihong Zhang
2010 Lecture Notes in Computer Science  
can predict diversities of failure events with the great detail.  ...  In this paper, we purpose an online log analysis approach to mine event correlations in system logs of large-scale cluster systems.  ...  Some work uses statistical analysis approach to find simple temporal and spatial laws or models of system events [6] [5] [10] [11] in large-scale cluster systems like BlueGene/L.  ... 
doi:10.1007/978-3-642-15672-4_23 fatcat:nlemct2yx5d4fme4thjyd7tv4y

BlueGene/L Failure Analysis and Prediction Models

Yinglung Liang, Yanyong Zhang, A. Sivasubramaniam, M. Jette, R. Sahoo
International Conference on Dependable Systems and Networks (DSN'06)  
In this study, we have collected RAS event logs from BlueGene/L over a period of more than 100 days.  ...  We have investigated the characteristics of fatal failure events, as well as the correlation between fatal events and non-fatal events.  ...  , that this paper attempts to fill using event/failure logs from the BlueGene/L system.  ... 
doi:10.1109/dsn.2006.18 dblp:conf/dsn/LiangZSJS06 fatcat:rctxd37jkfb6fiauia3vxffssi

Failure analysis, modeling, and prediction for BlueGene/L

Yinglung Liang
2007
increased.This dissertation is based on the Reliability, Availability and Serviceabilit (RAS) events generated by IBM BlueGene/L over a period of 142 days.  ...  Using these logs, we performed failure analysis, modeling, and prediction.  ...  To address these issues, in this study, we derive our prediction models from the failure logs collected from IBM BlueGene/L over a period of 142 days.  ... 
doi:10.7282/t32j6c8n fatcat:kppejlo7njcbdi5vz6vrz5imti

A System Fault Diagnosis Method with a Reclustering Algorithm

Zhe Yang, Shi Ying, Bingming Wang, Yiyao Li, Bo Dong, Jiangyi Geng, Ting Zhang, Pengwei Wang
2021 Scientific Programming  
The log analysis-based system fault diagnosis method can help engineers analyze the fault events generated by the system.  ...  The K-means algorithm can perform log analysis well and does not require a lot of prior knowledge, but the K-means-based system fault diagnosis method needs to be improved in both efficiency and accuracy  ...  BlueGene/L is a supercomputer developed by IBM, and it ranked first in the world's TOP500 supercomputer rankings. underbird is located in the Sandia National Laboratory in the United States. e system is  ... 
doi:10.1155/2021/6617882 fatcat:nliv6smj7bg75afryabrrhlwom

Lossless compression for large scale cluster logs

R. Balakrishnan, R.K. Sahoo
2006 Proceedings 20th IEEE International Parallel & Distributed Processing Symposium  
One of the biggest challenges these systems face, is to manage generated system logs while deploying in production environments.  ...  In this paper we propose a compression algorithm which preprocesses these logs before trying out any standard compression utilities.  ...  After careful observation and experimentation, we decided to leverage on following trends in the Bluegene/L logs to design custom compression algorithm. 1.  ... 
doi:10.1109/ipdps.2006.1639692 dblp:conf/ipps/BalakrishnanS06 fatcat:ps5rq4o3fnbzvhnrxwti7crj3y

On the use of event logs for the analysis of system failures

Antonio Pecchia
2011
The focus of the thesis is to evaluate the accuracy of current logging mechanisms at reporting failures, and to develop novel techniques to make event logs effective to infer failure data.  ...  Investigating the suitability of traditional assumptions and techniques underlying log-based failure analysis, in spite of the changes occurred in the computer systems industry, is of paramount importance  ...  Prediction methods have been proposed for IBM BlueGene/L [10] . The approach proposed in the paper was able to predict around 80% of memory and network failures and 47% of I/O failures.  ... 
doi:10.6092/unina/fedoa/8815 fatcat:5tls54mgzvft7bxvmz423i2ohe

Quantifying Temporal and Spatial Correlation of Failure Events for Proactive Management

Song Fu, Cheng-Zhong Xu
2007 Symposium on Reliable Distributed Systems. Proceedings  
We cluster failure events based on their correlations and predict their future occurrences.  ...  Moreover, failure events exhibit strong correlations in time and space domain.  ...  We would also like to thank Philip Sokolowski and Michael thompson for their kind help in data collection from the Wayne State Grid. This research was supported in part by U.S.  ... 
doi:10.1109/srds.2007.4365694 fatcat:5awp4kvtoffynjh4j6n4bby3hm

Exploring event correlation for failure prediction in coalitions of clusters

Song Fu, Cheng-Zhong Xu
2007 Proceedings of the 2007 ACM/IEEE conference on Supercomputing - SC '07  
Failure events in coalition systems exhibit strong correlations in time and space domain.  ...  We cluster failure events based on their correlations and predict their future occurrences.  ...  We would also like to thank Philip Sokolowski and Michael Thompson for their kind help in data collection from the WSU Grid. This research was supported in part by U.S.  ... 
doi:10.1145/1362622.1362678 dblp:conf/sc/FuX07 fatcat:5v3tla4ngff5vegdj6o4hxcz4e

Quantifying Temporal and Spatial Correlation of Failure Events for Proactive Management

Song Fu, Cheng-Zhong Xu
2007 2007 26th IEEE International Symposium on Reliable Distributed Systems (SRDS 2007)  
We cluster failure events based on their correlations and predict their future occurrences.  ...  Moreover, failure events exhibit strong correlations in time and space domain.  ...  We would also like to thank Philip Sokolowski and Michael thompson for their kind help in data collection from the Wayne State Grid. This research was supported in part by U.S.  ... 
doi:10.1109/srds.2007.18 dblp:conf/srds/FuX07 fatcat:ktwqpaljsfcwpfyu4ynotns634

Learning Towards Failure Prediction of High Performance Computing Clusters by Employing LSTM

2019 International Journal of Engineering and Advanced Technology  
This Failure prediction of high-performance computing clusters (HPCC) is a crucial issue and a hot problem for many years.  ...  We have employed the concept of long short-term memory (LSTM) with reinforcement learning to correct the prediction accuracy in real-time and provide a solution to the industry with reliable results  ...  Past studies on IBM BlueGene/L [23] and LANL [5] , [29] , exposed significant trends for the root cause of failures in HPCC.  ... 
doi:10.35940/ijeat.f7885.088619 fatcat:kjgttkraxrelrbvfnjrsg7scnm

Scaling file systems to support petascale clusters: A dependability analysis to support informed design choices

Shravan Gaonkar, Eric Rozier, Anthony Tong, William H. Sanders
2008 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN)  
Recent literature on failure analysis of BlueGene/L discusses various causes of increased downtime of supercomputers [7] .  ...  In this paper, we present a stochastic activity network model that uses failure rates computed from real log data to predict the reliability and availability of the storage architecture of the ABE supercomputer  ...  The authors would like to thank the security, storage and other affiliated groups at NCSA for their inputs, log files data and time that enabled us to consolidate this practical report.  ... 
doi:10.1109/dsn.2008.4630107 dblp:conf/dsn/GaonkarRTS08 fatcat:drbuofi6onewdc3u2fzikl4xxi
« Previous Showing results 1 — 15 out of 63 results