Measuring Hadoop Optimality by Lorenz Curve
로렌츠 커브를 이용한 하둡 플랫폼의 최적화 지수

Woo-Cheol Kim, Changryong Baek
2014 Korean Journal of Applied Statistics  
Ever increasing "Big data" can only be effectively processed by parallel computing. Parallel computing refers to a high performance computational method that achieves effectiveness by dividing a big query into smaller subtasks and aggregating results from subtasks to provide an output. However, it is well-known that parallel computing does not achieve scalability which means that performance is improved linearly by adding more computers because it requires a very careful assignment of tasks to
more » ... gnment of tasks to each node and collecting results in a timely manner. Hadoop is one of the most successful platforms to attain scalability. In this paper, we propose a measurement for Hadoop optimization by utilizing a Lorenz curve which is a proxy for the inequality of hardware resources. Our proposed index takes into account the intrinsic overhead of Hadoop systems such as CPU, disk I/O and network. Therefore, it also indicates that a given Hadoop can be improved explicitly and in what capacity. Our proposed method is illustrated with experimental data and substantiated by Monte Carlo simulations.
doi:10.5351/kjas.2014.27.2.249 fatcat:m7ik5bfzvnc5xogrysnc77h7aa