Autonomic self-healing in cloud computing platforms

Anton Gulenko, Technische Universität Berlin, Odej Kao
The demand for increasingly rich services with high-level abstractions drives the field of cloud computing. In both research and industry, additional layers and components continuously increase the complexity of these modern platforms. In truth, the size of cloud systems has long surpassed what human administrators are able to manage. Nonetheless, users and customers expect high availability and reliability from both the applications and the underlying platform, which is only possible through
more » ... tomation. Today, most automated dependability techniques focus on increasing the availability of distributed systems by preventing or masking component outages. However, both software and hardware components often exhibit a behavior where the delivered service degrades without becoming entirely unavailable. Such anomalies, also called gray failures or degraded states, originate from software bugs or other unforeseen issues with the system. Some application-specific systems attempt to handle certain types of anomalies by applying a pre-defined set of rules. In general, however, administrators have to resolve anomaly situations manually. In practice, there is no technique or system for detecting and resolving anomaly situations in a generic way. Accordingly, this thesis suggests an extension to traditional cloud infrastructures by providing self-healing functionalities. Our approach monitors live data streams collected from all critical system components and analyzes the collected data for anomalous be- havior. Once an anomaly is detected, the system further investigates the situation, determines the root cause, and automatically implements a remediation plan to resolve the problem. We analyze the requirements to build such a self-healing system and present an abstract system architecture that fulfills the given requirements. The proposed self-healing cloud provides administrators with a coherent set of configuration values that determine the level to which remediation workflows are executed automatically. Further, we design [...]
doi:10.14279/depositonce-10340 fatcat:bsuuybgoijfkxmnfcm7nx5ha6a