A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2020; you can also visit <a rel="external noopener" href="https://arxiv.org/ftp/arxiv/papers/1512/1512.08417.pdf">the original URL</a>. The file type is <code>application/pdf</code>.
<span class="release-stage" >pre-print</span>
The objective of this work was to utilize BigBench  as a Big Data benchmark and evaluate and compare two processing engines: MapReduce  and Spark . MapReduce is the established engine for processing data on Hadoop. Spark is a popular alternative engine that promises faster processing times than the established MapReduce engine. BigBench was chosen for this comparison because it is the first end-to-end analytics Big Data benchmark and it is currently under public review as TPCx-BB .<span class="external-identifiers"> <a target="_blank" rel="external noopener" href="https://arxiv.org/abs/1512.08417v2">arXiv:1512.08417v2</a> <a target="_blank" rel="external noopener" href="https://fatcat.wiki/release/jgjqj5ev55dt3dxdv54bpl77j4">fatcat:jgjqj5ev55dt3dxdv54bpl77j4</a> </span>
more »... ne of our goals was to evaluate the benchmark by performing various scalability tests and validate that it is able to stress test the processing engines. First, we analyzed the steps necessary to execute the available MapReduce implementation of BigBench  on Spark. Then, all the 30 BigBench queries were executed on MapReduce/Hive with different scale factors in order to see how the performance changes with the increase of the data size. Next, the group of HiveQL queries were executed on Spark SQL and compared with their respective Hive runtimes. This report gives a detailed overview on how to setup an experimental Hadoop cluster and execute BigBench on both Hive and Spark SQL. It provides the absolute times for all experiments preformed for different scale factors as well as query results which can be used to validate correct benchmark execution. Additionally, multiple issues and workarounds were encountered and solved during our work. An evaluation of the resource utilization (CPU, memory, disk and network usage) of a subset of representative BigBench queries is presented to illustrate the behavior of the different query groups on both processing engines. Last but not least it is important to mention that larger parts of this report are taken from the master thesis of Max-Georg Beer, entitled "Evaluation of BigBench on Apache Spark Compared to MapReduce" .
<a target="_blank" rel="noopener" href="https://web.archive.org/web/20200825190306/https://arxiv.org/ftp/arxiv/papers/1512/1512.08417.pdf" title="fulltext PDF download" data-goatcounter-click="serp-fulltext" data-goatcounter-title="serp-fulltext"> <button class="ui simple right pointing dropdown compact black labeled icon button serp-button"> <i class="icon ia-icon"></i> Web Archive [PDF] <div class="menu fulltext-thumbnail"> <img src="https://blobs.fatcat.wiki/thumbnail/pdf/9e/f0/9ef09e98abb60f45db5f69a65354b34be6437885.180px.jpg" alt="fulltext thumbnail" loading="lazy"> </div> </button> </a> <a target="_blank" rel="external noopener" href="https://arxiv.org/abs/1512.08417v2" title="arxiv.org access"> <button class="ui compact blue labeled icon button serp-button"> <i class="file alternate outline icon"></i> arxiv.org </button> </a>