Anomaly symptom recognition in distributed IT systems

Alexander Acker, Technische Universität Berlin, Odej Kao
The progressing global digitalization is driving innovative network technologies, computation platforms, and data-driven services. The number of components such as sensors, actuators, computing, storage, and network nodes, as well as a variety of service applications increases and results in IT systems of high complexity. A complex system is prone to errors or failures, but users expect services always to be available. Furthermore, high availability is essential for utilizing IT systems in
more » ... cal areas such as medicine, logistics, energy, or the manufacturing industry. There, failures that are not immediately resolved can lead to hazardous situations. Consequently, system operators are increasingly overwhelmed with the task of keeping complex IT systems at an operational state. Solutions that support the operation and maintenance of complex IT systems are required to support humans. For this purpose, artificial intelligence for IT system operations (AIOps) is being explored to improve the availability, maintainability, and reliability of IT systems. It combines the research areas of artificial intelligence, machine learning, and system operation to monitor relevant components, analyze the monitoring data, and automatically select and execute operations to maintain an efficient operational state. The automation should enable improved robustness against failures. This thesis introduces methods to increase the availability of IT systems by reducing the time required to resolve errors and failures. Thereby, system components whose operational state deviates from a known norm are referred to as anomalies. We employ pattern recognition to search monitoring data from anomalous components for specific patterns. The identification of these anomaly symptoms allows a comparison to historical occurrences of anomalies and an automatic selection of feasible operations to resolve them. Further, our implemented methods can identify patterns that are representing yet unknown anomalies. Such cases are delegated to human experts. [...]
doi:10.14279/depositonce-14761 fatcat:jk5ttfibufewjdmo2crzmnhdii