Automatic performance diagnosis and recovery in cloud microservices

Li Wu, Technische Universität Berlin, Odej Kao
2022
Microservices have emerged as a popular pattern for developing large-scale applications in cloud environments for its benefits of flexibility, scalability, and agility. A microservices-based cloud system (named as \emph{cloud microservices}) comprises hundreds or thousands of disparate services that communicate via lightweight messaging protocols, share a finite set of hardware and software resources, and are frequently updated to meet customer requirements. In such a complex and dynamic
more » ... ment, the occurrence of performance problems (e.g., slow application responses) has become the norm rather than the exception, resulting in decreased revenue, damaged reputation, and significant human effort spent on performance diagnosis and recovery. Moreover, manual operation and maintenance of cloud microservices tend to be error-prone or even impracticable. Therefore, there is an urgent need for an automatic performance problem management system that can not only detect anomalous behaviors (performance anomalies) but also uncover the root causes and recommend recovery actions. In this thesis, we investigate methods for automatic performance diagnosis and recovery in cloud microservices. The core objectives of this thesis are to identify \emph{where} and \emph{why} a performance anomaly occurs, and further to decide \emph{how} to mitigate it. To this end, this thesis contributes: (1) a method for locating the faulty service from which a performance anomaly originates, including a graphical model for capturing the propagation of the anomaly in the system; (2) two methods for identifying the anomalous metrics that cause a performance anomaly, using deep learning and Spatio-temporal causal inference (CI). In addition, we evaluate the performance of CI techniques on performance diagnosis in cloud microservices through extensive experiments; (3) a method for selecting the most appropriate recovery action to mitigate an identified performance anomaly. Overall, the methods presented in this thesis diagnose root causes and rec [...]
doi:10.14279/depositonce-14959 fatcat:byvzjzcizvar5bpwqhwcs4krcu