Shark

Reynold S. Xin, Josh Rosen, Matei Zaharia, Michael J. Franklin, Scott Shenker, Ion Stoica
2013 Proceedings of the 2013 international conference on Management of data - SIGMOD '13  
Shark is a new data analysis system that marries query processing with complex analytics on large clusters. It leverages a novel distributed memory abstraction to provide a unified engine that can run SQL queries and sophisticated analytics functions (e.g., iterative machine learning) at scale, and efficiently recovers from failures mid-query. This allows Shark to run SQL queries up to 100× faster than Apache Hive, and machine learning programs up to 100× faster than Hadoop. Unlike previous
more » ... ems, Shark shows that it is possible to achieve these speedups while retaining a MapReduce-like execution engine, and the fine-grained fault tolerance properties that such engines provide. It extends such an engine in several ways, including column-oriented in-memory storage and dynamic mid-query replanning, to effectively execute SQL. The result is a system that matches the speedups reported for MPP analytic databases over MapReduce, while offering fault tolerance properties and complex analytics capabilities that they lack. CREATE TABLE latest_logs TBLPROPERTIES ("shark.cache"=true) AS SELECT * FROM logs WHERE date > now()-3600;
doi:10.1145/2463676.2465288 dblp:conf/sigmod/XinRZFSS13 fatcat:qs4bvu7habd77g42mtm3m5sgoy