Supporting long-running applications in shared compute clusters

Panagiotis Garefalakis, Peter Pietzuch, Engineering And Physical Sciences Research Council
2020
Clusters that handle data-intensive workloads at a data-centre scale have become commonplace. In this setting, clusters are typically shared across several users and applications, and consolidate workloads that range from traditional analytics applications to critical services, stream processing, machine learning, and complex data processing applications. This constitutes a growing class of applications called long-running, consisting of containers that are used for durations ranging from hours
more » ... to months. Even though long-running applications occupy a significant amount of resources in shared compute clusters today, there is currently rudimentary systems support that not only hinders application performance and resilience but also decreases cluster resource efficiency i.e., the effective utility extracted from cluster resources. In this thesis, we describe two main areas that lack support for long-running applications in traditional system designs. First, the way modern data processing frameworks execute complex computation tasks as part of shared long-running containers is broken. Even though these frameworks enable users to combine different types of computation as part of the same application using high-level programming interfaces, they ignore their diverse latency and throughput requirements during execution. Second, existing systems that allocate resources for long-running applications in the form of containers lack an expressive interface that can capture their placement requirements in shared compute clusters. Such placements can be expressed by means of complex constraints and are critical for both the performance and resilience of long-running applications. To target the aforementioned mismatch, we introduce our contribution of unified dataflows with placement constraints, an abstraction that enables the efficient execution and the effective placement of long-running applications in shared compute clusters. Our abstraction is realised as part of a novel execution framework and a new cluster manager foll [...]
doi:10.25560/83110 fatcat:2vm3vuneprejpbozvc5xh65oaq