Dynamic speculative optimizations for SQL compilation in Apache Spark

Filippo Schiavio, Daniele Bonetta, Walter Binder
2020 Proceedings of the VLDB Endowment  
Big-data systems have gained significant momentum, and Apache Spark is becoming a de-facto standard for modern data analytics. Spark relies on SQL query compilation to optimize the execution performance of analytical workloads on a variety of data sources. Despite its scalable architecture, Spark's SQL code generation suffers from significant runtime overheads related to data access and de-serialization. Such performance penalty can be significant, especially when applications operate on
more » ... eadable data formats such as CSV or JSON. In this paper we present a new approach to query compilation that overcomes these limitations by relying on runtime profiling and dynamic code generation. Our new SQL compiler for Spark produces highly-efficient machine code, leading to speedups of up to 4.4x on the TPC-H benchmark with textual-form data formats such as CSV or JSON.
doi:10.14778/3377369.3377382 fatcat:5jm4mgvxpjhapi46kjcjjshsfq