UnifyDR: A Generic Framework for Unifying Data and Replica Placement

Ankita Atrey, Gregory Van Seghbroeck, Higinio Mora, Bruno Volckaert, Filip De Turck
2020 IEEE Access  
The advent of (big) data management applications operating at Cloud scale has led to extensive research on the data placement problem. The key objective of data placement is to obtain a partitioning (possibly allowing for replicas) of a set of data-items into distributed nodes that minimizes the overall network communication cost. Although replication is intrinsic to data placement, it has seldom been studied in combination with the latter. On the contrary, most of the existing solutions treat
more » ... hem as two independent problems, and employ a two-phase approach: (1) data placement, followed by (2) replica placement. We address this by proposing a new paradigm, CDR, with the objective of combining data and replica placement as a single joint optimization problem. Specifically, we study two variants of the CDR problem: (1) CDR-Single, where the objective is to minimize the communication cost alone, and (2) CDR-Multi, which performs a multi-objective optimization to also minimize traffic and storage costs. To unify data and replica placement, we propose a generic framework called UnifyDR, which leverages overlapping correlation clustering to assign a data-item to multiple nodes, thereby facilitating data and replica placement to be performed jointly. We establish the generic nature of UnifyDR by portraying its ability to address the CDR problem in two real-world use-cases, that of join-intensive online analytical processing (OLAP) queries and a location-based online social network (OSN) service. The effectiveness and scalability of UnifyDR are showcased by experiments performed on data generated using the TPC-DS benchmark and a trace of the Gowalla OSN for the OLAP queries and OSN service use-case, respectively. Empirically, the presented approach obtains an improvement of approximately 35% in terms of the evaluated metrics and a speed-up of 8 times in comparison to state-of-the-art techniques. INDEX TERMS Data placement, replica placement, OLAP, online social networks, join-intensive queries, location-based services, scalability, overlapping clustering. 216894 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ VOLUME 8, 2020
doi:10.1109/access.2020.3041670 fatcat:pz6i555firfobjgm2uedfwzhcu