Processing large-scale data with Apache Spark
Apache Spark를 활용한 대용량 데이터의 처리

Seyoon Ko, Joong-Ho Won
<span title="2016-10-31">2016</span> <i title="The Korean Statistical Society"> <a target="_blank" rel="noopener" href="" style="color: black;">Korean Journal of Applied Statistics</a> </i> &nbsp;
Apache Spark is a fast and general-purpose cluster computing package. It provides a new abstraction named resilient distributed dataset, which is capable of support for fault tolerance while keeping data in memory. This type of abstraction results in a significant speedup compared to legacy large-scale data framework, MapReduce. In particular, Spark framework is suitable for iterative machine learning applications such as logistic regression and K-means clustering, and interactive data
more &raquo; ... Spark also supports high level libraries for various applications such as machine learning, streaming data processing, database querying and graph data mining thanks to its versatility. In this work, we introduce the concept and programming model of Spark as well as show some implementations of simple statistical computing applications. We also review the machine learning package MLlib, and the R language interface SparkR.
<span class="external-identifiers"> <a target="_blank" rel="external noopener noreferrer" href="">doi:10.5351/kjas.2016.29.6.1077</a> <a target="_blank" rel="external noopener" href="">fatcat:ljrmw53inje5vjife5l5mmsr44</a> </span>
