A comparison on scalability for batch big data processing on Apache Spark and Apache Flink

Diego García-Gil, Sergio Ramírez-Gallego, Salvador García, Francisco Herrera
2017 Big Data Analytics  
The large amounts of data have created a need for new frameworks for processing. The MapReduce model is a framework for processing and generating large-scale datasets with parallel and distributed algorithms. Apache Spark is a fast and general engine for large-scale data processing based on the MapReduce model. The main feature of Spark is the in-memory computation. Recently a novel framework called Apache Flink has emerged, focused on distributed stream and batch data processing. In this paper
more » ... we perform a comparative study on the scalability of these two frameworks using the corresponding Machine Learning libraries for batch data processing. Additionally we analyze the performance of the two Machine Learning libraries that Spark currently has, MLlib and ML. For the experiments, the same algorithms and the same dataset are being used. Experimental results show that Spark MLlib has better perfomance and overall lower runtimes than Flink.
doi:10.1186/s41044-016-0020-2 fatcat:b6uqpjj7nfei7lckkafbrdktpi