A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2008; you can also visit the original URL.
The file type is application/pdf
.
Filters
Failure prediction in IBM BlueGene/L event logs
2008
Proceedings, International Parallel and Distributed Processing Symposium (IPDPS)
To this end, we have collected detailed event logs from IBM BlueGene/L, which has 128K processors, and is currently the fastest supercomputer in the world. ...
In order to develop effective fault-tolerant strategies, there is a critical need to predict failure events. ...
To address these issues, in this study, we derive our prediction models from the failure logs collected from IBM BlueGene/L over a period of 142 days. ...
doi:10.1109/ipdps.2008.4536397
dblp:conf/ipps/ZhangS08
fatcat:yhpij5kuxvgb7cksqjg2yag6ka
Failure Prediction in IBM BlueGene/L Event Logs
2007
Seventh IEEE International Conference on Data Mining (ICDM 2007)
To this end, we have collected detailed event logs from IBM BlueGene/L, which has 128K processors, and is currently the fastest supercomputer in the world. ...
In order to develop effective fault-tolerant strategies, there is a critical need to predict failure events. ...
To address these issues, in this study, we derive our prediction models from the failure logs collected from IBM BlueGene/L over a period of 142 days. ...
doi:10.1109/icdm.2007.46
dblp:conf/icdm/LiangZXS07
fatcat:tgbmzaha5vcppln7ddbxq6nqrm
LogMaster: Mining Event Correlations in Logs of Large-Scale Cluster Systems
2012
2012 IEEE 31st Symposium on Reliable Distributed Systems
severity, in logs of large-scale cloud and HPC systems. ...
The experimental results on three logs of production cloud and HPC systems, varying from 433490 entries to 4747963 entries, show that our method can predict failures with a high precision and an acceptable ...
Similar observations are found in the HPC cluster and BlueGene/L logs. ...
doi:10.1109/srds.2012.40
dblp:conf/srds/FuRZZJL12
fatcat:s64itbbeazagdc2kyzar2fyrpm
LogMaster: Mining Event Correlations in Logs of Large scale Cluster Systems
[article]
2013
arXiv
pre-print
in three scenarios: (a) predicting all events on the basis of both failure and non-failure events; (b) predicting only failure events on the basis of both failure and non-failure events; (c) predicting ...
This paper presents a methodology and a system, named LogMaster, for mining correlations of events that have multiple attributions, i.e., node ID, application ID, event type, and event severity, in logs ...
Some work uses statistical analysis approach to find simple temporal and spatial laws or models of system events [13] [3] [28] in large-scale cluster systems like Bluegene/L. ...
arXiv:1003.0951v2
fatcat:2ver5llwineone5ozmb42spxwa
Online Event Correlations Analysis in System Logs of Large-Scale Cluster Systems
[chapter]
2010
Lecture Notes in Computer Science
can predict diversities of failure events with the great detail. ...
In this paper, we purpose an online log analysis approach to mine event correlations in system logs of large-scale cluster systems. ...
Some work uses statistical analysis approach to find simple temporal and spatial laws or models of system events [6] [5] [10] [11] in large-scale cluster systems like BlueGene/L. ...
doi:10.1007/978-3-642-15672-4_23
fatcat:nlemct2yx5d4fme4thjyd7tv4y
BlueGene/L Failure Analysis and Prediction Models
International Conference on Dependable Systems and Networks (DSN'06)
In this study, we have collected RAS event logs from BlueGene/L over a period of more than 100 days. ...
We have investigated the characteristics of fatal failure events, as well as the correlation between fatal events and non-fatal events. ...
, that this paper attempts to fill using event/failure logs from the BlueGene/L system. ...
doi:10.1109/dsn.2006.18
dblp:conf/dsn/LiangZSJS06
fatcat:rctxd37jkfb6fiauia3vxffssi
Failure analysis, modeling, and prediction for BlueGene/L
2007
increased.This dissertation is based on the Reliability, Availability and Serviceabilit (RAS) events generated by IBM BlueGene/L over a period of 142 days. ...
Using these logs, we performed failure analysis, modeling, and prediction. ...
To address these issues, in this study, we derive our prediction models from the failure logs collected from IBM BlueGene/L over a period of 142 days. ...
doi:10.7282/t32j6c8n
fatcat:kppejlo7njcbdi5vz6vrz5imti
A System Fault Diagnosis Method with a Reclustering Algorithm
2021
Scientific Programming
The log analysis-based system fault diagnosis method can help engineers analyze the fault events generated by the system. ...
The K-means algorithm can perform log analysis well and does not require a lot of prior knowledge, but the K-means-based system fault diagnosis method needs to be improved in both efficiency and accuracy ...
BlueGene/L is a supercomputer developed by IBM, and it ranked first in the world's TOP500 supercomputer rankings. underbird is located in the Sandia National Laboratory in the United States. e system is ...
doi:10.1155/2021/6617882
fatcat:nliv6smj7bg75afryabrrhlwom
Lossless compression for large scale cluster logs
2006
Proceedings 20th IEEE International Parallel & Distributed Processing Symposium
One of the biggest challenges these systems face, is to manage generated system logs while deploying in production environments. ...
In this paper we propose a compression algorithm which preprocesses these logs before trying out any standard compression utilities. ...
After careful observation and experimentation, we decided to leverage on following trends in the Bluegene/L logs to design custom compression algorithm. 1. ...
doi:10.1109/ipdps.2006.1639692
dblp:conf/ipps/BalakrishnanS06
fatcat:ps5rq4o3fnbzvhnrxwti7crj3y
On the use of event logs for the analysis of system failures
2011
The focus of the thesis is to evaluate the accuracy of current logging mechanisms at reporting failures, and to develop novel techniques to make event logs effective to infer failure data. ...
Investigating the suitability of traditional assumptions and techniques underlying log-based failure analysis, in spite of the changes occurred in the computer systems industry, is of paramount importance ...
Prediction methods have been proposed for IBM BlueGene/L [10] . The approach proposed in the paper was able to predict around 80% of memory and network failures and 47% of I/O failures. ...
doi:10.6092/unina/fedoa/8815
fatcat:5tls54mgzvft7bxvmz423i2ohe
Quantifying Temporal and Spatial Correlation of Failure Events for Proactive Management
2007
Symposium on Reliable Distributed Systems. Proceedings
We cluster failure events based on their correlations and predict their future occurrences. ...
Moreover, failure events exhibit strong correlations in time and space domain. ...
We would also like to thank Philip Sokolowski and Michael thompson for their kind help in data collection from the Wayne State Grid. This research was supported in part by U.S. ...
doi:10.1109/srds.2007.4365694
fatcat:5awp4kvtoffynjh4j6n4bby3hm
Exploring event correlation for failure prediction in coalitions of clusters
2007
Proceedings of the 2007 ACM/IEEE conference on Supercomputing - SC '07
Failure events in coalition systems exhibit strong correlations in time and space domain. ...
We cluster failure events based on their correlations and predict their future occurrences. ...
We would also like to thank Philip Sokolowski and Michael Thompson for their kind help in data collection from the WSU Grid. This research was supported in part by U.S. ...
doi:10.1145/1362622.1362678
dblp:conf/sc/FuX07
fatcat:5v3tla4ngff5vegdj6o4hxcz4e
Quantifying Temporal and Spatial Correlation of Failure Events for Proactive Management
2007
2007 26th IEEE International Symposium on Reliable Distributed Systems (SRDS 2007)
We cluster failure events based on their correlations and predict their future occurrences. ...
Moreover, failure events exhibit strong correlations in time and space domain. ...
We would also like to thank Philip Sokolowski and Michael thompson for their kind help in data collection from the Wayne State Grid. This research was supported in part by U.S. ...
doi:10.1109/srds.2007.18
dblp:conf/srds/FuX07
fatcat:ktwqpaljsfcwpfyu4ynotns634
Learning Towards Failure Prediction of High Performance Computing Clusters by Employing LSTM
2019
International Journal of Engineering and Advanced Technology
This Failure prediction of high-performance computing clusters (HPCC) is a crucial issue and a hot problem for many years. ...
We have employed the concept of long short-term memory (LSTM) with reinforcement learning to correct the prediction accuracy in real-time and provide a solution to the industry with reliable results ...
Past studies on IBM BlueGene/L [23] and LANL [5] , [29] , exposed significant trends for the root cause of failures in HPCC. ...
doi:10.35940/ijeat.f7885.088619
fatcat:kjgttkraxrelrbvfnjrsg7scnm
Scaling file systems to support petascale clusters: A dependability analysis to support informed design choices
2008
2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN)
Recent literature on failure analysis of BlueGene/L discusses various causes of increased downtime of supercomputers [7] . ...
In this paper, we present a stochastic activity network model that uses failure rates computed from real log data to predict the reliability and availability of the storage architecture of the ABE supercomputer ...
The authors would like to thank the security, storage and other affiliated groups at NCSA for their inputs, log files data and time that enabled us to consolidate this practical report. ...
doi:10.1109/dsn.2008.4630107
dblp:conf/dsn/GaonkarRTS08
fatcat:drbuofi6onewdc3u2fzikl4xxi
« Previous
Showing results 1 — 15 out of 63 results