An experimental evaluation of garbage collectors on big data applications
Proceedings of the VLDB Endowment
Popular big data frameworks, ranging from Hadoop MapReduce to Spark, rely on garbage-collected languages, such as Java and Scala. Big data applications are especially sensitive to the effectiveness of garbage collection (i.e., GC), because they usually process a large volume of data objects that lead to heavy GC overhead. Lacking indepth understanding of GC performance has impeded performance improvement in big data applications. In this paper, we conduct the first comprehensive evaluation on
... ree popular garbage collectors, i.e., Parallel, CMS, and G1, using four representative Spark applications. By thoroughly investigating the correlation between these big data applications' memory usage patterns and the collectors' GC patterns, we obtain many findings about GC inefficiencies. We further propose empirical guidelines for application developers, and insightful optimization strategies for designing big-datafriendly garbage collectors. PVLDB Reference Format: collectors. Based on the above analysis, we obtain ten findings on the GC inefficiencies for big data applications. We further propose several guidelines for application developers and GC optimization strategies for researchers. Our main findings and optimization approaches are summarized as follows. Key Findings. (1) Big data applications' unique memory usage patterns (e.g., long-lived shuffled data and humongous data objects), and computation features (e.g., iterative computation and CPU-intensive data operators) contribute to the substantial performance differences among garbage collectors. (2) The concurrent collectors, such as CMS and G1, can reduce the GC pause time while reclaiming the long-lived shuffled data. However, they hinder CPU-intensive data operators due to serious CPU contention. (3) All three collectors are inefficient for managing humongous data objects, which lead to frequent GC cycles and even OOM errors in non-contiguous collectors like G1. Proposed optimizations. (1) All three collectors cannot allocate proper heap space to accommodate long-lived shuffled data. To optimize object allocation, we propose a new heap resizing policy through memory usage prediction and dynamic heap space adjustment. (2) All three collectors suffer from unnecessary continuous GC while reclaiming the long-lived shuffled and cached data. By leveraging data lifecycles, we propose a new object marking algorithm to reduce GC frequency. (3) All three collectors are inefficient for iterative applications that need to reclaim large volume of shuffled data in each iteration. By leveraging the distinctive lifecycles and fixed size of these data, we propose a new object sweeping algorithm for achieving no GC pause for iterative applications. In addition, we identify the root causes of two OOM errors, namely Spark framework's memory leak in handling consecutive shuffle spills and G1's heap fragmentation problem. Spark and OpenJDK communities have confirmed our identified causes [19, 10] . In summary, our main contributions are as follows.