A Survey on Automatic Parameter Tuning for Big Data Processing Systems

Herodotos Herodotou, Yuxing Chen, Jiaheng Lu
2020 ACM Computing Surveys  
Big data processing systems (e.g., Hadoop, Spark, Storm) contain a vast number of configuration parameters controlling parallelism, I/O behavior, memory settings, and compression. Improper parameter settings can cause significant performance degradation and stability issues. However, regular users and even expert administrators grapple with understanding and tuning them to achieve good performance. We investigate existing approaches on parameter tuning for both batch and stream data processing
more » ... ystems and classify them into six categories: rule-based, cost modeling, simulation-based, experiment-driven, machine learning, and adaptive tuning. We summarize the pros and cons of each approach and raise some open research problems for automatic parameter tuning. 43:2 H. Herodotou et al. resource management and load balancing [88] . Improper settings of configuration parameters are shown to have detrimental effects on the overall system performance and stability [39, 61, 103] . The use of automated parameter tuning techniques is a promising, yet challenging approach for optimizing system performance. The major challenges are: (1) Large and complex parameter space: Hadoop, Spark, and Storm have over 200 configurable parameters each [22, 70, 103] . To make matters worse, some parameters might affect the performance of different jobs in different ways, while certain groups of parameters may have dependent effects (i.e., a good setting for one parameter may depend on the setting of a different parameter) [61, 66]. (2) System scale and complexity: As data analytics platforms have grown in scale and complexity, system administrators may need to configure and tune hundreds to thousands of nodes, some equipped with different CPUs, memory, storage media, and network stacks [80] . In addition, executing MapReduce or Spark workloads with iterative stages and tasks in parallel or serial makes it challenging to observe and model workload performance [46] . (3) Lack of input data statistics: Data statistics are almost never available for MapReduce and Spark applications, since data often reside in semi-or un-structured files and are opaque until accessed [58] . As for stream applications, the input data are a real-time data stream that typically experiences significant variations in workload properties [34] . Classification of Approaches: A considerable amount of past research tackles the problem of performance optimization by partially or fully automating the process of finding near-optimal parameter values for executing jobs in big data processing systems. This survey performs a comprehensive study of existing parameter-tuning approaches, which address various challenges towards high throughput and resource utilization, fast response time, and cost-effectiveness. Due to the various challenges and scenarios addressed, different strategies or approaches are proposed accordingly. We classify these approaches into the following six categories:
doi:10.1145/3381027 fatcat:7aglimtuwze25boptuano4ufdy