Quantifying Cloud Performance and Dependability

Nikolas Herbst, Cristina L. Abad, Alexandru Iosup, André Bauer, Samuel Kounev, Giorgos Oikonomou, Erwin Van Eyk, George Kousiouris, Athanasia Evangelinou, Rouven Krebs, Tim Brecht
2018 ACM Transactions on Modeling and Performance Evaluation of Computing Systems  
In only a decade, cloud computing has emerged from a pursuit for a service-driven information and communication technology (ICT), becoming a significant fraction of the ICT market. Responding to the growth of the market, many alternative cloud services and their underlying systems are currently vying for the attention of cloud users and providers. To make informed choices between competing cloud service providers, permit the cost-benefit analysis of cloud-based systems, and enable system DevOps
more » ... to evaluate and tune the performance of these complex ecosystems, appropriate performance metrics, benchmarks, tools, and methodologies are necessary. This requires re-examining old system properties and considering new system properties, possibly leading to the re-design of classic benchmarking metrics such as expressing performance as throughput and latency (response time). In this work, we address these requirements by focusing on four system properties: (i) elasticity of the cloud service, to accommodate large variations in the amount of service requested, (ii) performance isolation between the tenants of shared cloud systems and resulting performance variability, (iii) availability of cloud services and systems, and (iv) the operational risk of running a production system in a cloud environment. Focusing on key metrics for each of these properties, we review the state-of-the-art, then select or propose new metrics together with measurement approaches. We see the presented metrics as a foundation toward upcoming, future industry-standard cloud benchmarks. INTRODUCTION Cloud computing is a paradigm under which ICT services are offered on demand "as a service," where resources providing the service are dynamically adjusted to meet the needs of a varying workload. Over the past decade, cloud computing has become increasingly important for the information and communication technology (ICT) industry. Cloud applications already represent a significant share of the entire ICT market in Europe [20] (with similar fractions expected for North America, Middle East, and Asia). By 2025, over three-quarters of global business and personal data may reside in the cloud, according to a recent IDC report [36] . This promising growth trend makes clouds an interesting new target for benchmarking, with the goal of comparing, tuning, and improving the increasingly large set of cloud-based systems and applications, and the cloud fabric itself. However, traditional approaches to benchmarking my not be well suited to cloud computing environments. In classical benchmarking, system performance metrics are measured on systemsunder-test (SUTs) that are well-defined, well-behaved, and often operating on a fixed or at least pre-defined set of resources. In contrast, cloud-based systems add new, different challenges to benchmarking, because they can be built using a rich yet volatile combination of infrastructure, platforms, and entire software stacks, which recursively can be built out of cloud systems and offered as cloud services. For example, to allow its subscribers to browse the offered and then to watch the selected videos on TVs, smart-phones, and other devices, Netflix utilizes a cloud service that provides its front-end services, operates its own cloud services to generate different bit-rate and device-specific encoding, and leverages many other cloud services from monitoring to payment. A key to benchmarking the rich service and resource tapestry that characterizes many cloud services and their underlying ecosystems is the re-definition of traditional benchmarking metrics for cloud settings, and the definition of new metrics that are unique to cloud computing. This is the focus of our work, and the main contribution of this article. Academic studies, concerned public reports, and even company white papers indicate that a variety of new operational and user-centric properties of system quality (i.e., non-functional properties) are important in cloud settings. We consider four such properties. First, cloud systems are expected to deliver an illusion of infinite capacity and capability, yet to appear perfectly elastic to offer important economies of scale. Second, cloud services and systems have been shown to exhibit high-performance variability [38], against which modern cloud users now expect protection (performance isolation). Third, also because the recursive nature of cloud services can lead to cascading failures across multiple clouds when even one fails, increasingly more demanding users expect that the availability of cloud services is nearly perfect, and that even a few unavailability events will cause significant reputation and pecuniary damage to a cloud provider. Fourth, as the risks of not meeting implicit user-expectations and explicit service contracts (service level agreements, SLAs) ACM Trans. Model. Perform. Eval. are increasing with the scale of cloud operations, cloud providers have become increasingly more interested in quantifying, reducing, and possibly reporting their operational risk. With the market growing and maturing, many cloud services are now competing to retain existing customers and to attract new customers. Consequently, being able to benchmark, quantify, and compare the capabilities of competing systems is increasingly important. We first examine the research question: For the four properties of cloud services we consider, can existing metrics be applied to cloud computing environments and be used to compare services? Responding to this question, our survey of the state-of-the-art (see Section 9) indicates that the existing body of work on (cloud) performance measurement and assessment, albeit valuable, does not address satisfactorily the question, and in particular the existing metrics leave important conceptual and practical gaps in quantifying elasticity, performance isolation and variability, availability, and operational risk for cloud services. Therefore, we propose the main research question investigated in this work: Q: Which new metrics can be useful to measure, examine, and compare cloud-based systems, for the four properties we consider in this work?
doi:10.1145/3236332 fatcat:ne3amvxqdra4pg25leeytv4vn4