A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2021; you can also visit <a rel="external noopener" href="https://arxiv.org/pdf/2112.02480v1.pdf">the original URL</a>. The file type is <code>application/pdf</code>.
<span class="release-stage" >pre-print</span>
The Join operator, as one of the most expensive and commonly used operators in database systems, plays a substantial role in Database Management System (DBMS) performance. Among the many different Join algorithms studied over the last decades, Hybrid Hash Join (HHJ) has proven to be one of the most efficient and widely-used join algorithms. While the performance of HHJ depends largely on accurate statistics and information about the input relations, it may not always be practical or possible<span class="external-identifiers"> <a target="_blank" rel="external noopener" href="https://arxiv.org/abs/2112.02480v1">arXiv:2112.02480v1</a> <a target="_blank" rel="external noopener" href="https://fatcat.wiki/release/edxylscxrzgwlbxcmy3ku33s7m">fatcat:edxylscxrzgwlbxcmy3ku33s7m</a> </span>
more »... a system to have such information available. The design of HHJ depends on many details to perform well. This paper is an experimental and analytical study of the trade-offs in designing a robust and dynamic HHJ operator. We revisit the design and optimization techniques suggested by previous studies through extensive experiments, comparing them with other algorithms designed by us or used in related studies. We explore the impact of the number of partitions on the performance of HHJ and propose a lower bound and a default value for the number of partitions. We continue by designing and evaluating different partition insertion techniques to maximize memory utilization with the least CPU cost. In addition, we consider a comprehensive set of algorithms for dynamically selecting a partition to spill and compare the results against previously published studies. We then present two alternative growth policies for spilled partitions and study their effectiveness using experimental and model-based analyses. These algorithms have been implemented in the context of Apache AsterixDB and evaluated under different scenarios such as variable record sizes, different distributions of join attributes, and different storage types, including HDD, SSD, and Amazon Elastic Block Store (Amazon EBS).
<a target="_blank" rel="noopener" href="https://web.archive.org/web/20211208124239/https://arxiv.org/pdf/2112.02480v1.pdf" title="fulltext PDF download" data-goatcounter-click="serp-fulltext" data-goatcounter-title="serp-fulltext"> <button class="ui simple right pointing dropdown compact black labeled icon button serp-button"> <i class="icon ia-icon"></i> Web Archive [PDF] <div class="menu fulltext-thumbnail"> <img src="https://blobs.fatcat.wiki/thumbnail/pdf/da/e4/dae403d265004dd37c4417ffee710301fdc2af68.180px.jpg" alt="fulltext thumbnail" loading="lazy"> </div> </button> </a> <a target="_blank" rel="external noopener" href="https://arxiv.org/abs/2112.02480v1" title="arxiv.org access"> <button class="ui compact blue labeled icon button serp-button"> <i class="file alternate outline icon"></i> arxiv.org </button> </a>