A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2018; you can also visit the original URL.
The file type is application/pdf
.
Filters
Using Message Logs and Resource Use Data for Cluster Failure Diagnosis
2016
2016 IEEE 23rd International Conference on High Performance Computing (HiPC)
2016) Using message logs and resource use data for cluster failure diagnosis. ...
ACKNOWLEDGEMENTS We would like to thank the Texas Advanced Computing Center (TACC) for providing the Ranger cluster log data. ...
Zhou Changjiu and Singapore Polytechnic senior management for allowing the principal author to complete this work. ...
doi:10.1109/hipc.2016.035
dblp:conf/hipc/ChuahJBGNB16
fatcat:cs7oj54a6jhrtlomrxpniqsh6m
Enabling Dependability-Driven Resource Use and Message Log-Analysis for Cluster System Diagnosis
2017
2017 IEEE 24th International Conference on High Performance Computing (HiPC)
How to cite: Please refer to published version for the most recent bibliographic citation information. ...
ACKNOWLEDGEMENTS We would like to thank the Texas Advanced Computing Center (TACC) for providing the Ranger cluster log data and granting access to their systems administrators. ...
We also thank Karl Solchenbach (Intel Corporation, Europe) for granting access to his research scientists. ...
doi:10.1109/hipc.2017.00044
dblp:conf/hipc/ChuahJADGSBMB17
fatcat:6cuzyr5vsvcn7irb76o6vrv2nu
Failure Diagnosis for Cluster Systems using Partial Correlations
2021
Zenodo
As HPC systems conduct extensive logging of resource usage and system events, parsing this data is an oft advocated basis for failure diagnosis. ...
The novel failure diagnostics workflow - called IFADE - extracts partial correlation of resource use counters and partial correlation of system errors. ...
IFADE makes use of the system logs [12] , [13] and resource use data [14] for its analysis. ...
doi:10.5281/zenodo.5509414
fatcat:7w4hzzpt4jcwtpby25jyun5cse
Priolog: Mining Important Logs via Temporal Analysis and Prioritization
2019
Sustainability
However, the growing software complexity and volume of logs make it increasingly challenging to mine useful insights from logs for problem diagnosis. ...
We demonstrate the concepts, design, and evaluation results using actual logs. ...
GAUL [28] is for problem diagnosis using logs in storage systems. It uses logs to detect recurring problems and solutions. ...
doi:10.3390/su11226306
fatcat:oejkg4o74za3jdzkq4tkqov4gm
Online Filtering of Massive Log Data in the Cloud Computing System
2014
International Journal of Database Theory and Application
Log data is a valuable resource for failure prediction and troubleshooting in large-scale systems. ...
losing important information required for the fault diagnosis. ...
Acknowledgements These should be brief and placed at the end of the text before the references. ...
doi:10.14257/ijdta.2014.7.4.22
fatcat:ydmrmtbvwjgsjkc4qc67lg3ruy
Challenges to Error Diagnosis in Hadoop Ecosystems
2013
USENIX Large Installation Systems Administration Conference
We report on some failure experiences in a real world deployment of HBase/Hadoop and propose some initial ideas for better trouble-shooting during deployment. ...
These errors are difficult to diagnose because of scattered log management and lack of ecosystem-awareness in many diagnosis tools and processes. ...
We experimented and demonstrated the feasibility of the approach using a small set of common Hadoop ecosystem errors. ...
dblp:conf/lisa/LiHZXFBLT13
fatcat:b3pqyvmicnfj3lwiwtdx3f6g7u
An Exploratory Survey of Hadoop Log Analysis Tools
2013
International Journal of Computer Applications
This paper presents an exploratory assessment of the different log analyzers used for failure detection and monitoring in Hadoop. General Terms Failure Monitoring ...
The majority of these tools congregates necessary information from each of the node in the cluster and takes it for processing. These diagnosis tools are mostly post execution analysis tools. ...
The chief advantage with Hadoop is that it allows for the storage of data in any format. The massive use of this framework calls for the faster analysis and diagnosis of failures. ...
doi:10.5120/13350-0750
fatcat:rcwjkd56zfcqdamgeyvcjpfkta
Log clustering based problem identification for online service systems
2016
Proceedings of the 38th International Conference on Software Engineering Companion - ICSE '16
When an online service fails, engineers need to examine recorded logs to gain insights into the failure and identify the potential problems. ...
Traditionally, engineers perform simple keyword search (such as "error" and "exception") of logs that may be associated with the failures. Such an approach is often time consuming and error prone. ...
Acknowledgement We thank the intern students Can Zhang and Bowen Deng for the helpful discussions and the initial experiments. ...
doi:10.1145/2889160.2889232
dblp:conf/icse/LinZLZC16
fatcat:ttq5hwlfnrdw3kygce5vb4xiwu
LogM: Log Analysis for Multiple Components of Hadoop Platform
2021
IEEE Access
data, which allows us to predict system failures. ...
We then adopt a knowledge graph approach for failure analysis and diagnosis. Extensive experiments have been carried out to assess the performance of the proposed approach. ...
RELATED WORK As a valuable resource in system maintenance, system logs can be used for effective anomaly detection and problem diagnosis. ...
doi:10.1109/access.2021.3076897
fatcat:g3xen2dhejb5niyxwepmuob3r4
Automated Performance Management for the Big Data Stack
2019
Conference on Innovative Data Systems Research
More than 10,000 enterprises worldwide today use the big data stack that is composed of multiple distributed systems. ...
This sample also covers the spectrum of choices for deploying the big data stack across on-premises datacenters, private cloud deployments, public cloud deployments, and hybrid combinations of these. ...
Next, let us look at possible ways to automate the process of failure diagnosis by building predictive models that continuously learn from logs of past application failures for which the respective root ...
dblp:conf/cidr/ArvanitisBCPSW19
fatcat:35mqpk66krakjnxk5ug234w63y
Computing at Massive Scale: Scalability and Dependability Challenges
2016
2016 IEEE Symposium on Service-Oriented System Engineering (SOSE)
Large-scale Cloud systems and big data analytics frameworks are now widely used for practical services and applications. ...
We then examine and analyze several fundamental challenges and the solutions we are developing to tackle them, including for example, incremental resource scheduling and incremental messaging communication ...
Inc. for their work and supports. ...
doi:10.1109/sose.2016.73
dblp:conf/sose/YangX16
fatcat:bsbdpnfzpnf5jbl2d3hobd7adu
Preparing Distributed Computing Operations for the HL-LHC Era With Operational Intelligence
2022
Frontiers in Big Data
Machine learning, data mining, log analysis, and anomaly detection are only some of the tools we have evaluated for our use cases. ...
In this community study contribution, we report on the development of a suite of operational intelligence services to cover various use cases: workload management, data management, and site operations. ...
There is already a variety of tools for log and error message parsing that perform clustering using methods such as frequent pattern mining, machine learning clustering, grouping by longest common subsequence ...
doi:10.3389/fdata.2021.753409
pmid:35072060
pmcid:PMC8776639
fatcat:evwlw3eilzhebhcvrtdbeco634
One Graph Is Worth a Thousand Logs: Uncovering Hidden Structures in Massive System Event Logs
[chapter]
2009
Lecture Notes in Computer Science
We demonstrate the usefulness of our analysis, on real world logs from various systems, for debugging of complex systems, efficient search and visualization of logs and characterization of system behavior ...
The first is a sequential and efficient text clustering algorithm which automatically discovers the templates generating the messages. ...
The first use case, and also the most straightforward one, is to use the transformed event logs to aid in diagnosis of system problems. ...
doi:10.1007/978-3-642-04180-8_32
fatcat:4bxk5sgmefcvdb2zl3anary4za
Digging deeper into cluster system logs for failure prediction and root cause diagnosis
2014
2014 IEEE International Conference on Cluster Computing (CLUSTER)
Many methods for failure prediction are based on analyzing event logs for large scale systems, but there is still neither a widely used one to predict failures based on both non-fatal and fatal events, ...
System logs play a critical role in the increasingly complex tasks of automatic failure prediction and diagnosis. ...
Logs of large-scale clusters are the primary resources for implementing dependability: they track system behaviors by accurately recording detailed data about a system's changing states. ...
doi:10.1109/cluster.2014.6968768
dblp:conf/cluster/FuRMZS14
fatcat:zb5nxolfo5ftncuemh3dw4t43m
Energy efficient secured cluster based distributed fault diagnosis protocol for IoT
2022
International Journal of Communication Networks and Information Security
EESCFD) Model which combines the self-fault diagnosis routing model using cluster based approach and block cipher to organize a secured data communication and to identify security fault and communication ...
This research work deals with an IoT security over WSN model to overcome the security and performance issues by designing a Energy efficient secured cluster based distributed fault diagnosis protocol ( ...
𝑁 𝑐𝑘 = 𝐶 𝑝𝑘 × ℎ𝑜𝑝 𝑚𝑑𝑙 ̅̅̅̅̅̅ ,ftype Step 6: Upon receiving a message from group of forwarder node the destination node the destination node decrypts the data using cluster key and verifies the ...
doi:10.17762/ijcnis.v10i3.3586
fatcat:it7t7fa4yjfwdlyqcvvomvzgsy
« Previous
Showing results 1 — 15 out of 7,243 results