5,981 Hits in 6.7 sec

Log clustering based problem identification for online service systems

Qingwei Lin, Hongyu Zhang, Jian-Guang Lou, Yu Zhang, Xuewei Chen
2016 Proceedings of the 38th International Conference on Software Engineering Companion - ICSE '16  
When an online service fails, engineers need to examine recorded logs to gain insights into the failure and identify the potential problems.  ...  Engineers only need to examine a small number of previously unseen, representative log sequences extracted from the clusters to identify a problem, thus significantly reducing the number of logs that should  ...  We thank our product team partners for their collaboration and suggestions on the applications of LogCluster.  ... 
doi:10.1145/2889160.2889232 dblp:conf/icse/LinZLZC16 fatcat:ttq5hwlfnrdw3kygce5vb4xiwu

A Survey on Automated Log Analysis for Reliability Engineering [article]

Shilin He, Pinjia He, Zhuangbin Chen, Tianyi Yang, Yuxin Su, Michael R. Lyu
2021 arXiv   pre-print
event templates, and how to employ logs to detect anomalies, predict failures, and facilitate diagnosis.  ...  As modern software is evolving into a large scale, the volume of logs has increased rapidly.  ...  Instead of directly clustering log messages, LogSig transformed each log message into a set of word pairs and clustered logs based on the corresponding pairs.  ... 
arXiv:2009.07237v2 fatcat:thbtfboglnglld5rr6s2gqhizi

Failure Diagnosis for Cluster Systems using Partial Correlations

Edward Chuah, Arshad Jhumka, Samantha Alt, R. Todd Evans, Neeraj Suri
2021 Zenodo  
As HPC systems conduct extensive logging of resource usage and system events, parsing this data is an oft advocated basis for failure diagnosis.  ...  IFADE has been put on the public domain to support system administrators in failure diagnosis.  ...  In [27] , the authors presented a principled approach to obtain new insight into failure prediction using log-data on the Computer Failure Data Repository.  ... 
doi:10.5281/zenodo.5509414 fatcat:7w4hzzpt4jcwtpby25jyun5cse

Enabling Dependability-Driven Resource Use and Message Log-Analysis for Cluster System Diagnosis

Edward Chuah, Arshad Jhumka, Samantha Alt, Theo Damoulas, Nentawe Gurumdimma, Marie-Christine Sawley, William L. Barth, Tommy Minyard, James C. Browne
2017 2017 IEEE 24th International Conference on High Performance Computing (HiPC)  
If a published version is known of, the repository item page linked to above, will contain details on accessing it.  ...  How to cite: Please refer to published version for the most recent bibliographic citation information.  ...  ACKNOWLEDGEMENTS We would like to thank the Texas Advanced Computing Center (TACC) for providing the Ranger cluster log data and granting access to their systems administrators.  ... 
doi:10.1109/hipc.2017.00044 dblp:conf/hipc/ChuahJADGSBMB17 fatcat:6cuzyr5vsvcn7irb76o6vrv2nu

Analysis of execution log files

Meiyappan Nagappan
2010 Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering - ICSE '10  
The goal of my dissertation research is to investigate log management and analysis techniques suited for very large and very complex logs, such as those we might expect in a computational cloud system.  ...  Log analysis can be used to find problems, define operational profiles, and even pro-actively prevent issues.  ...  They are i) high level users of a system who generally do not look into a log file, ii) system administrators who look into the log file to find a workaround for the failure that the high level users have  ... 
doi:10.1145/1810295.1810405 dblp:conf/icse/Nagappan10 fatcat:f5idxyu5ebe7rgl4uhjoqvj4uq

Digging deeper into cluster system logs for failure prediction and root cause diagnosis

Xiaoyu Fu, Rui Ren, Sally A. McKee, Jianfeng Zhan, Ninghui Sun
2014 2014 IEEE International Conference on Cluster Computing (CLUSTER)  
System logs play a critical role in the increasingly complex tasks of automatic failure prediction and diagnosis.  ...  Finally, we extract failure rules based on the observation that events of the same event types, on the same nodes or from the same applications have similar operational behaviors.  ...  Checking the log trace reveals that this FGP is also shared by other nodes in this cluster. The system log messages comprise streams of interleaved events that may have resulted from many FGPs.  ... 
doi:10.1109/cluster.2014.6968768 dblp:conf/cluster/FuRMZS14 fatcat:zb5nxolfo5ftncuemh3dw4t43m

Insight: In-situ Online Service Failure Path Inference in Production Computing Infrastructures

Hiep Nguyen, Daniel Joseph Dean, Kamal Kc, Xiaohui Gu
2014 USENIX Annual Technical Conference  
Insight leverages both environment data (e.g., input logs, configuration files, states of interacting components) and runtime outputs (e.g., console logs, system calls) to guide the failure path finding  ...  We have implemented Insight and evaluated it using 13 failures from a production cloud management system and 8 open source software systems.  ...  We also thank VCL system administrators Aaron Peeler and Andy Kurth for providing us with the log data and their generous help on validation. We thank Anwesha Das for helping with the experiments.  ... 
dblp:conf/usenix/NguyenDKG14 fatcat:gl3tfduganhtpbq4mmlrzljj3u

Priolog: Mining Important Logs via Temporal Analysis and Prioritization

Byungchul Tak, Seorin Park, Prabhakar Kudva
2019 Sustainability  
However, the growing software complexity and volume of logs make it increasingly challenging to mine useful insights from logs for problem diagnosis.  ...  In this paper, we propose a novel technique, Priolog, that can narrow down the volume of logs into a small set of important and most relevant logs.  ...  Conflicts of Interest: The authors declare no conflict of interest.  ... 
doi:10.3390/su11226306 fatcat:oejkg4o74za3jdzkq4tkqov4gm

SemParser: A Semantic Parser for Log Analysis [article]

Yintong Huo, Yuxin Su, Baitong Li, Michael R. Lyu
2021 arXiv   pre-print
We believe these findings provide insights into semantically understanding log messages for the log analysis community.  ...  To analyze the effectiveness of our semantic parser, we first demonstrate that it can derive rich semantics from log messages collected from seven widely-applied systems with an average F1 score of 0.987  ...  While anomaly detection identies present faults from logs, failure diagnosis looks deeper into the problems and specify why the failure appears.  ... 
arXiv:2112.12636v2 fatcat:pddhsmqkpfdh5mj7jlzpavsydm

Automated Performance Management for the Big Data Stack

Anastasios Arvanitis, Shivnath Babu, Eric Chu, Adrian Popescu, Alkis Simitsis, Kevin Wilkinson
2019 Conference on Innovative Data Systems Research  
We provide an overview of the requirements both at the level of individual applications as well as holistic clusters and workloads.  ...  More than 10,000 enterprises worldwide today use the big data stack that is composed of multiple distributed systems.  ...  Next, let us look at possible ways to automate the process of failure diagnosis by building predictive models that continuously learn from logs of past application failures for which the respective root  ... 
dblp:conf/cidr/ArvanitisBCPSW19 fatcat:35mqpk66krakjnxk5ug234w63y

Kahuna: Problem diagnosis for Mapreduce-based cloud computing environments

Jiaqi Tan, Xinghao Pan, Eugene Marinelli, Soila Kavulya, Rajeev Gandhi, Priya Narasimhan
2010 2010 IEEE Network Operations and Management Symposium - NOMS 2010  
We motivate our peer-similarity observations through concrete evidence from the 4000-processor Yahoo! M45 Hadoop cluster.  ...  Kahuna incorporates techniques to statistically compare black-box (OS-level performance metrics) and white-box (Hadoop-log statistics) data across the different nodes of a MapReduce cluster, in order to  ...  DIAGNOSIS APPROACH Based on our key insights in Section IV, we assert that a node whose behavior differs from the majority of nodes in the cluster is anomalous and can be a potential source of a performance  ... 
doi:10.1109/noms.2010.5488446 dblp:conf/noms/TanPMKGN10 fatcat:obskhje7grhdflqzpv7idbiotq

One Graph Is Worth a Thousand Logs: Uncovering Hidden Structures in Massive System Event Logs [chapter]

Michal Aharon, Gilad Barash, Ira Cohen, Eli Mordechai
2009 Lecture Notes in Computer Science  
We demonstrate the usefulness of our analysis, on real world logs from various systems, for debugging of complex systems, efficient search and visualization of logs and characterization of system behavior  ...  The first is a sequential and efficient text clustering algorithm which automatically discovers the templates generating the messages.  ...  Once again a multitude of logs were processed by the algorithms and, when visualized, provided some insight into the problem in the system.  ... 
doi:10.1007/978-3-642-04180-8_32 fatcat:4bxk5sgmefcvdb2zl3anary4za

Extracting the textual and temporal structure of supercomputing logs

Sourabh Jain, Inderpreet Singh, Abhishek Chandra, Zhi-Li Zhang, Greg Bronevetsky
2009 2009 International Conference on High Performance Computing (HiPC)  
In this work we propose a novel method to succinctly represent the contents of supercomputing logs, by using textual clustering to automatically find the syntactic structures of log messages.  ...  Further, we describe a methodology for using the temporal proximity between groups of log messages to identify correlated events in the system.  ...  The first insight is that our methodology extracts a significant number of correlated textual-clusters from the bgl logs.  ... 
doi:10.1109/hipc.2009.5433202 dblp:conf/hipc/JainSCZB09 fatcat:ujmsshqtnngn7cfqrzodhlc4aa

What Distributed Systems Say: A Study of Seven Spark Application Logs [article]

Sina Gholamian, Paul A. S. Ward
2021 arXiv   pre-print
Our research draws insightful findings for developers and practitioners on how to set up and utilize their distributed systems to benefit from the execution logs.  ...  We also evaluate the log effectiveness and the information gain values, and study the changes in performance and the generated logs for each benchmark with various types of distributed system failures.  ...  into the runtime state of the system.  ... 
arXiv:2108.08395v1 fatcat:caf42zizvrhkrhywr23rn32e6u

Preparing Distributed Computing Operations for the HL-LHC Era With Operational Intelligence

Alessandro Di Girolamo, Federica Legger, Panos Paparrigopoulos, Jaroslava Schovancová, Thomas Beermann, Michael Boehler, Daniele Bonacorsi, Luca Clissa, Leticia Decker de Sousa, Tommaso Diotalevi, Luca Giommi, Maria Grigorieva (+14 others)
2022 Frontiers in Big Data  
The distributed computing systems currently deployed by the LHC experiments have proven to be mature and capable of meeting the experimental goals, by allowing timely delivery of scientific results.  ...  Machine learning, data mining, log analysis, and anomaly detection are only some of the tools we have evaluated for our use cases.  ...  The Jobs Buster provides insights into jobs failure causes in ATLAS, as it extracts the essential information from the jobs’ logs, and serves it in a comprehensive manner to the operators.  ... 
doi:10.3389/fdata.2021.753409 pmid:35072060 pmcid:PMC8776639 fatcat:evwlw3eilzhebhcvrtdbeco634
« Previous Showing results 1 — 15 out of 5,981 results