49,758 Hits in 5.6 sec

Failure Diagnosis for Cluster Systems using Partial Correlations

Edward Chuah, Arshad Jhumka, Samantha Alt, R. Todd Evans, Neeraj Suri
2021 Zenodo  
The novel failure diagnostics workflow - called IFADE - extracts partial correlation of resource use counters and partial correlation of system errors.  ...  As HPC systems conduct extensive logging of resource usage and system events, parsing this data is an oft advocated basis for failure diagnosis.  ...  to ascertain partial correlations for failure diagnosis. • We diagnose system failures without prior knowledge of a fault model.  ... 
doi:10.5281/zenodo.5509414 fatcat:7w4hzzpt4jcwtpby25jyun5cse

Multidimensional Analysis of System Logs in Large-scale Cluster Systems [article]

Wei Zhou, Jianfeng Zhan, Dan Meng
2009 arXiv   pre-print
It is effective to improve the reliability and availability of large-scale cluster systems through the analysis of failures.  ...  Existed failure analysis methods understand and analyze failures from one or few dimension. The analysis results are partial and with less precision because of the limitation of data source.  ...  Correlating instrumentation data to system states- A building The research includes these key points: block for automated diagnosis and control.  ... 
arXiv:0906.1328v1 fatcat:uot5au7kerccvcstdffrcrcqf4

Research on Fault Diagnosis of Launch Vehicle's Power Transformation and Transmission System Based on Big Data

Yichi Zhang, Tao Shu, Xincheng Song, Yan Xu, Pengxiang Zhang, Jie Chen
2021 Mathematical Problems in Engineering  
On the basis of big data, this paper introduces the failure mode clustering algorithm, the state parameter correlation analysis algorithm, the fault diagnosis method based on the correlation matrix, and  ...  The on-board power supply system provides power for the launch vehicle.  ...  In the past 10 years, live detection and online monitoring systems for primary equipment have been widely used in China.  ... 
doi:10.1155/2021/3108000 fatcat:5hldc6q4qfej7bjayxxotm2hqm

Inter-Process Correlation Model based Hybrid Framework for Fault Diagnosis in Wireless Sensor Networks

2019 KSII Transactions on Internet and Information Systems  
The proposed model is realized through local and global decision trees for fault diagnosis.  ...  Simulation results validate the inter-process correlation model-based fault diagnosis.  ...  In this case, a failure report containing partial diagnosis results is generated and sent to DCH for detailed analysis.  ... 
doi:10.3837/tiis.2019.02.004 fatcat:52fi7quoanbe3om2un77mrsvj4

Guided Problem Diagnosis through Active Learning

Songyun Duan, Shivnath Babu
2008 2008 International Conference on Autonomic Computing  
We report an experimental evaluation of our algorithm using data from a variety of failures-both single failures and multiple correlated failures-injected in a testbed, as well as with synthetic data.  ...  Previous work on wholly-or partially-automated diagnosis focused on L or U in isolation.  ...  Related Work There has been plenty of previous work on wholly-or partially-automated techniques for diagnosing performance and availability problems in systems.  ... 
doi:10.1109/icac.2008.28 dblp:conf/icac/DuanB08 fatcat:uuhfggcnqbckdnlxxmja7hjfja

Diagnostic Agent Based Inter-Process Communication Aware Monitoring System for Wireless Sensor Networks

Amna Zafar, Ali Hammad Akber
2019 Mehran University Research Journal of Engineering and Technology  
Therefore, efficient and effective monitoring systems for fault detection and diagnosis are imperative for fault tolerance and robust operation of WSN to meet critical application requirements for reliability  ...  Local diagnostic agent is implemented on sensor nodes for self-monitoring and network wide fault diagnosis is performed by global diagnostic agent on cluster head.  ...  Agent) performs fault diagnosis within a cluster.  ... 
doi:10.22581/muet1982.1902.07 fatcat:utx4ktb5dnflfc6cg6ezmdbrbi

Agnostic Diagnosis: Discovering Silent Failures in Wireless Sensor Networks

Xin Miao, Kebin Liu, Yuan He, Dimitris Papadias, Qiang Ma, Yunhao Liu
2013 IEEE Transactions on Wireless Communications  
Currently, there is no effective solution for silent failures because they are often diverse and highly system-related.  ...  On the other hand, our experience with GreenOrbs, a long-term large-scale WSN system, reveals the need of diagnosis in an agnostic manner.  ...  AD explores the correlation between system metrics using a two-stage cross validation scheme to detect silent failures.  ... 
doi:10.1109/twc.2013.110813.121812 fatcat:qlwekztk2rasjeqqajza7ap4ju

Agnostic diagnosis: Discovering silent failures in wireless sensor networks

Xin Miao, Kebin Liu, Yuan He, Yunhao Liu, Dimitris Papadias
2011 2011 Proceedings IEEE INFOCOM  
Currently, there is no effective solution for silent failures because they are often diverse and highly system-related.  ...  On the other hand, our experience with GreenOrbs, a long-term large-scale WSN system, reveals the need of diagnosis in an agnostic manner.  ...  AD explores the correlation between system metrics using a two-stage cross validation scheme to detect silent failures.  ... 
doi:10.1109/infcom.2011.5934945 dblp:conf/infocom/MiaoLHLP11 fatcat:av6iu3mtfzftzhlw5nhierksfy

An Autonomic Cycle of Data Analysis Tasks for the Supervision of HVAC Systems of Smart Building

Jose Aguilar, Douglas Ardila, Andrés Avendaño, Felipe Macias, Camila White, José Gomez-Pulido, José Gutierrez de Mesa, Alberto Garces-Jimenez
2020 Energies  
Data models for fault detection and diagnosis are increasingly used for extracting knowledge in the supervisory tasks.  ...  Early fault detection and diagnosis in heating, ventilation and air conditioning (HVAC) systems may reduce the damage of equipment, improving the reliability and safety of smart buildings, generating social  ...  Task 3: Diagnosis of Failures Table 9 . 9 Silhouette Coefficient obtained for different number of clusters applied to one single HVAC subsystem. No.  ... 
doi:10.3390/en13123103 fatcat:33qlg2ppkrarpe75ifgrkridem

Failure Diagnosis of Complex Systems [chapter]

Soila P. Kavulya, Kaustubh Joshi, Felicita Di Giandomenico, Priya Narasimhan
2012 Resilience Assessment and Evaluation of Computing Systems  
The results of diagnosis also provide data about a system's operational fault profile for use in offline resilience evaluation.  ...  Failure diagnosis is the process of identifying the causes of impairment in a system's function based on observable symptoms, i.e., determining which fault led to an observed failure.  ...  Using the fault, error, and failure nomenclature of [53] , failure diagnosis is the process of identifying the fault that has led to an observed failure of a system or its constituent components.  ... 
doi:10.1007/978-3-642-29032-9_12 fatcat:dyxufulyhfgpfbjruwizaj7eia

Enabling Dependability-Driven Resource Use and Message Log-Analysis for Cluster System Diagnosis

Edward Chuah, Arshad Jhumka, Samantha Alt, Theo Damoulas, Nentawe Gurumdimma, Marie-Christine Sawley, William L. Barth, Tommy Minyard, James C. Browne
2017 2017 IEEE 24th International Conference on High Performance Computing (HiPC)  
How to cite: Please refer to published version for the most recent bibliographic citation information.  ...  ACKNOWLEDGEMENTS We would like to thank the Texas Advanced Computing Center (TACC) for providing the Ranger cluster log data and granting access to their systems administrators.  ...  We also thank Karl Solchenbach (Intel Corporation, Europe) for granting access to his research scientists.  ... 
doi:10.1109/hipc.2017.00044 dblp:conf/hipc/ChuahJADGSBMB17 fatcat:6cuzyr5vsvcn7irb76o6vrv2nu

Using Message Logs and Resource Use Data for Cluster Failure Diagnosis

Edward Chuah, Arshad Jhumka, James C. Browne, Nentawe Gurumdimma, Sai Narasimhamurthy, Bill Barth
2016 2016 IEEE 23rd International Conference on High Performance Computing (HiPC)  
2016) Using message logs and resource use data for cluster failure diagnosis.  ...  ACKNOWLEDGEMENTS We would like to thank the Texas Advanced Computing Center (TACC) for providing the Ranger cluster log data.  ...  Zhou Changjiu and Singapore Polytechnic senior management for allowing the principal author to complete this work.  ... 
doi:10.1109/hipc.2016.035 dblp:conf/hipc/ChuahJBGNB16 fatcat:cs7oj54a6jhrtlomrxpniqsh6m

Lessons Learned from Spatial and Temporal Correlation of Node Failures in High Performance Computers

Siavash Ghiasvand, Florina M. Ciorba, Ronny Tschuter, Wolfgang E. Nagel
2016 2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP)  
The results of this study are aimed at helping the system administrators minimize (or even prevent) the destructive effects of correlated node failures.  ...  We draw possible types of correlations between node failures and show that, in many cases, there are direct correlations between observed node failures.  ...  The authors also thank Holger Mickler of Technische Universität Dresden for his support in collecting the monitoring information on the high performance computing system.  ... 
doi:10.1109/pdp.2016.101 dblp:conf/pdp/GhiasvandCTN16 fatcat:ycvrjtcxwbaa7ckoomqndqqppi

Fa: A System for Automating Failure Diagnosis

Songyun Duan, Shivnath Babu, Kamesh Munagala
2009 Proceedings / International Conference on Data Engineering  
Fa uses a new technique called anomalybased clustering when the signature database has no highconfidence match for an undiagnosed failure.  ...  This paper identifies two key data-mining problems arising in a platform for automated diagnosis called Fa.  ...  [7] applies decisiontree learning techniques to rank different system components based on their correlation with system failures.  ... 
doi:10.1109/icde.2009.115 dblp:conf/icde/DuanBM09 fatcat:jio22hgwfbhrziztmjcywt7hma

Spatio-temporal patterns in network events

Ting Wang, Mudhakar Srivatsa, Dakshi Agrawal, Ling Liu
2010 Proceedings of the 6th International COnference on - Co-NEXT '10  
Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation here on.  ...  The first author and the last author are partially supported by grants from NSF CyberTrust program, NSF NetSE program, an IBM SUR grant, and a grant from Intel research council.  ...  Using a signature matching based approach allows us to partially match a network-event stream with a fault signature and predict the fault even before the failure event is actually observed (or received  ... 
doi:10.1145/1921168.1921172 dblp:conf/conext/WangSAL10 fatcat:mruamubuebb3hbfp3s3ocrmunq
« Previous Showing results 1 — 15 out of 49,758 results