7,243 Hits in 6.1 sec

Using Message Logs and Resource Use Data for Cluster Failure Diagnosis

Edward Chuah, Arshad Jhumka, James C. Browne, Nentawe Gurumdimma, Sai Narasimhamurthy, Bill Barth
2016 2016 IEEE 23rd International Conference on High Performance Computing (HiPC)  
2016) Using message logs and resource use data for cluster failure diagnosis.  ...  ACKNOWLEDGEMENTS We would like to thank the Texas Advanced Computing Center (TACC) for providing the Ranger cluster log data.  ...  Zhou Changjiu and Singapore Polytechnic senior management for allowing the principal author to complete this work.  ... 
doi:10.1109/hipc.2016.035 dblp:conf/hipc/ChuahJBGNB16 fatcat:cs7oj54a6jhrtlomrxpniqsh6m

Enabling Dependability-Driven Resource Use and Message Log-Analysis for Cluster System Diagnosis

Edward Chuah, Arshad Jhumka, Samantha Alt, Theo Damoulas, Nentawe Gurumdimma, Marie-Christine Sawley, William L. Barth, Tommy Minyard, James C. Browne
2017 2017 IEEE 24th International Conference on High Performance Computing (HiPC)  
How to cite: Please refer to published version for the most recent bibliographic citation information.  ...  ACKNOWLEDGEMENTS We would like to thank the Texas Advanced Computing Center (TACC) for providing the Ranger cluster log data and granting access to their systems administrators.  ...  We also thank Karl Solchenbach (Intel Corporation, Europe) for granting access to his research scientists.  ... 
doi:10.1109/hipc.2017.00044 dblp:conf/hipc/ChuahJADGSBMB17 fatcat:6cuzyr5vsvcn7irb76o6vrv2nu

Failure Diagnosis for Cluster Systems using Partial Correlations

Edward Chuah, Arshad Jhumka, Samantha Alt, R. Todd Evans, Neeraj Suri
2021 Zenodo  
As HPC systems conduct extensive logging of resource usage and system events, parsing this data is an oft advocated basis for failure diagnosis.  ...  The novel failure diagnostics workflow - called IFADE - extracts partial correlation of resource use counters and partial correlation of system errors.  ...  IFADE makes use of the system logs [12] , [13] and resource use data [14] for its analysis.  ... 
doi:10.5281/zenodo.5509414 fatcat:7w4hzzpt4jcwtpby25jyun5cse

Priolog: Mining Important Logs via Temporal Analysis and Prioritization

Byungchul Tak, Seorin Park, Prabhakar Kudva
2019 Sustainability  
However, the growing software complexity and volume of logs make it increasingly challenging to mine useful insights from logs for problem diagnosis.  ...  We demonstrate the concepts, design, and evaluation results using actual logs.  ...  GAUL [28] is for problem diagnosis using logs in storage systems. It uses logs to detect recurring problems and solutions.  ... 
doi:10.3390/su11226306 fatcat:oejkg4o74za3jdzkq4tkqov4gm

Online Filtering of Massive Log Data in the Cloud Computing System

Zhou Li, Baojin Zhu, Xiaopeng Zheng, Liye Zhang
2014 International Journal of Database Theory and Application  
Log data is a valuable resource for failure prediction and troubleshooting in large-scale systems.  ...  losing important information required for the fault diagnosis.  ...  Acknowledgements These should be brief and placed at the end of the text before the references.  ... 
doi:10.14257/ijdta.2014.7.4.22 fatcat:ydmrmtbvwjgsjkc4qc67lg3ruy

Challenges to Error Diagnosis in Hadoop Ecosystems

Jim Zhanwen Li, Siyuan He, Liming Zhu, Xiwei Xu, Min Fu, Len Bass, Anna Liu, An Binh Tran
2013 USENIX Large Installation Systems Administration Conference  
We report on some failure experiences in a real world deployment of HBase/Hadoop and propose some initial ideas for better trouble-shooting during deployment.  ...  These errors are difficult to diagnose because of scattered log management and lack of ecosystem-awareness in many diagnosis tools and processes.  ...  We experimented and demonstrated the feasibility of the approach using a small set of common Hadoop ecosystem errors.  ... 
dblp:conf/lisa/LiHZXFBLT13 fatcat:b3pqyvmicnfj3lwiwtdx3f6g7u

An Exploratory Survey of Hadoop Log Analysis Tools

Madhury Mohandas, Dhanya P M
2013 International Journal of Computer Applications  
This paper presents an exploratory assessment of the different log analyzers used for failure detection and monitoring in Hadoop. General Terms Failure Monitoring  ...  The majority of these tools congregates necessary information from each of the node in the cluster and takes it for processing. These diagnosis tools are mostly post execution analysis tools.  ...  The chief advantage with Hadoop is that it allows for the storage of data in any format. The massive use of this framework calls for the faster analysis and diagnosis of failures.  ... 
doi:10.5120/13350-0750 fatcat:rcwjkd56zfcqdamgeyvcjpfkta

Log clustering based problem identification for online service systems

Qingwei Lin, Hongyu Zhang, Jian-Guang Lou, Yu Zhang, Xuewei Chen
2016 Proceedings of the 38th International Conference on Software Engineering Companion - ICSE '16  
When an online service fails, engineers need to examine recorded logs to gain insights into the failure and identify the potential problems.  ...  Traditionally, engineers perform simple keyword search (such as "error" and "exception") of logs that may be associated with the failures. Such an approach is often time consuming and error prone.  ...  Acknowledgement We thank the intern students Can Zhang and Bowen Deng for the helpful discussions and the initial experiments.  ... 
doi:10.1145/2889160.2889232 dblp:conf/icse/LinZLZC16 fatcat:ttq5hwlfnrdw3kygce5vb4xiwu

LogM: Log Analysis for Multiple Components of Hadoop Platform

Yuxia Xie, Kai Yang, Pan Luo
2021 IEEE Access  
data, which allows us to predict system failures.  ...  We then adopt a knowledge graph approach for failure analysis and diagnosis. Extensive experiments have been carried out to assess the performance of the proposed approach.  ...  RELATED WORK As a valuable resource in system maintenance, system logs can be used for effective anomaly detection and problem diagnosis.  ... 
doi:10.1109/access.2021.3076897 fatcat:g3xen2dhejb5niyxwepmuob3r4

Automated Performance Management for the Big Data Stack

Anastasios Arvanitis, Shivnath Babu, Eric Chu, Adrian Popescu, Alkis Simitsis, Kevin Wilkinson
2019 Conference on Innovative Data Systems Research  
More than 10,000 enterprises worldwide today use the big data stack that is composed of multiple distributed systems.  ...  This sample also covers the spectrum of choices for deploying the big data stack across on-premises datacenters, private cloud deployments, public cloud deployments, and hybrid combinations of these.  ...  Next, let us look at possible ways to automate the process of failure diagnosis by building predictive models that continuously learn from logs of past application failures for which the respective root  ... 
dblp:conf/cidr/ArvanitisBCPSW19 fatcat:35mqpk66krakjnxk5ug234w63y

Computing at Massive Scale: Scalability and Dependability Challenges

Renyu Yang, Jie Xu
2016 2016 IEEE Symposium on Service-Oriented System Engineering (SOSE)  
Large-scale Cloud systems and big data analytics frameworks are now widely used for practical services and applications.  ...  We then examine and analyze several fundamental challenges and the solutions we are developing to tackle them, including for example, incremental resource scheduling and incremental messaging communication  ...  Inc. for their work and supports.  ... 
doi:10.1109/sose.2016.73 dblp:conf/sose/YangX16 fatcat:bsbdpnfzpnf5jbl2d3hobd7adu

Preparing Distributed Computing Operations for the HL-LHC Era With Operational Intelligence

Alessandro Di Girolamo, Federica Legger, Panos Paparrigopoulos, Jaroslava Schovancová, Thomas Beermann, Michael Boehler, Daniele Bonacorsi, Luca Clissa, Leticia Decker de Sousa, Tommaso Diotalevi, Luca Giommi, Maria Grigorieva (+14 others)
2022 Frontiers in Big Data  
Machine learning, data mining, log analysis, and anomaly detection are only some of the tools we have evaluated for our use cases.  ...  In this community study contribution, we report on the development of a suite of operational intelligence services to cover various use cases: workload management, data management, and site operations.  ...  There is already a variety of tools for log and error message parsing that perform clustering using methods such as frequent pattern mining, machine learning clustering, grouping by longest common subsequence  ... 
doi:10.3389/fdata.2021.753409 pmid:35072060 pmcid:PMC8776639 fatcat:evwlw3eilzhebhcvrtdbeco634

One Graph Is Worth a Thousand Logs: Uncovering Hidden Structures in Massive System Event Logs [chapter]

Michal Aharon, Gilad Barash, Ira Cohen, Eli Mordechai
2009 Lecture Notes in Computer Science  
We demonstrate the usefulness of our analysis, on real world logs from various systems, for debugging of complex systems, efficient search and visualization of logs and characterization of system behavior  ...  The first is a sequential and efficient text clustering algorithm which automatically discovers the templates generating the messages.  ...  The first use case, and also the most straightforward one, is to use the transformed event logs to aid in diagnosis of system problems.  ... 
doi:10.1007/978-3-642-04180-8_32 fatcat:4bxk5sgmefcvdb2zl3anary4za

Digging deeper into cluster system logs for failure prediction and root cause diagnosis

Xiaoyu Fu, Rui Ren, Sally A. McKee, Jianfeng Zhan, Ninghui Sun
2014 2014 IEEE International Conference on Cluster Computing (CLUSTER)  
Many methods for failure prediction are based on analyzing event logs for large scale systems, but there is still neither a widely used one to predict failures based on both non-fatal and fatal events,  ...  System logs play a critical role in the increasingly complex tasks of automatic failure prediction and diagnosis.  ...  Logs of large-scale clusters are the primary resources for implementing dependability: they track system behaviors by accurately recording detailed data about a system's changing states.  ... 
doi:10.1109/cluster.2014.6968768 dblp:conf/cluster/FuRMZS14 fatcat:zb5nxolfo5ftncuemh3dw4t43m

Energy efficient secured cluster based distributed fault diagnosis protocol for IoT

Tabassum Ara
2022 International Journal of Communication Networks and Information Security  
EESCFD) Model which combines the self-fault diagnosis routing model using cluster based approach and block cipher to organize a secured data communication and to identify security fault and communication  ...  This research work deals with an IoT security over WSN model to overcome the security and performance issues by designing a Energy efficient secured cluster based distributed fault diagnosis protocol (  ...  𝑁 𝑐𝑘 = 𝐶 𝑝𝑘 × ℎ𝑜𝑝 𝑚𝑑𝑙 ̅̅̅̅̅̅ ,ftype Step 6: Upon receiving a message from group of forwarder node the destination node the destination node decrypts the data using cluster key and verifies the  ... 
doi:10.17762/ijcnis.v10i3.3586 fatcat:it7t7fa4yjfwdlyqcvvomvzgsy
« Previous Showing results 1 — 15 out of 7,243 results