Structural Clustering: A New Approach to Support Performance Analysis at Scale
2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
The increasing complexity of high performance computing systems creates high demands on performance tools and human analysts due to an unmanageable volume of data gathered for performance analysis. A promising approach for reducing data volume is classification of data from multiple processes into groups of similar behavior to aid in analyzing application performance and identifying hot spots. However, existing approaches for structural and temporal classification of performance data suffer
... lack of scalability or produce misleading results. To address this problem, we present a novel and effective structural similarity measure to efficiently classify data from parallel processes and introduce a method for efficient storage of the classified data. Using four examples, we show how existing performance analysis techniques benefit from our structural classification. Finally, we present a case study with 15 applications on up to 65,536 parallel processes that demonstrates the generality and scalability of our classification approach. I . I N T R O D U C T I O N Performance analysis is a central part of the software life cycle for High Performance Computing (HPC) applications. However, performance analysis can be extremely challenging, in particular for long-running, large-scale jobs due to the potentially very large volume of performance data collected for analysis. Performance tools vary in how much information they collect and ultimately retain, ranging from large, highly-detailed event traces to compact, high-level profiles. Traces are sequential, typically time-stamped, records of execution events, and are useful for uncovering the root causes of performance problems with a temporal component. On the other hand, profiles typically provide a summary of execution information, such as the total amount of time spent in each function, or the average number of bytes transferred in a particular communication operation, and are useful for understanding the overall picture of an execution. Tracing and profiling each have their purpose and benefits, but both also have drawbacks. In tracing, all details are retained, but that makes analysis extremely challenging due to the potentially large numbers of events for a large number of processes. Profiles only contain summary data, but because the data is reduced naively, performance differences across time or processes can be lost, making understanding performance problems impossible. In each case, the amount of data collected serves the purpose of the method, but also presents challenges for analysis. We need to find a middle ground, a way to intelligently reduce the amount of data for analysis, such that trace analysis is tractable, while at the same time selectively aggregating data in targeted profiles with less information loss. In this paper, we present a new approach that finds such a middle ground and complements existing methods for analysis of traces and profiles, while retaining their respective strengths. To achieve this goal, our approach performs grouping of performance data, and automatically compares and categorizes performance measurements of parallel applications. In particular, we develop a lightweight system that pre-clusters structurally similar processes into a few distinct groups in preparation for more advanced analyses, such as a time-based clustering, alignment-based comparison  , or comparison based on profiles of different executions. By providing only a small number of groups from possibly thousands of processes we reduce the workload imposed on successive analysis techniques and focus their usage on meaningful subsets of processes, resulting in quicker detection of performance problems. Our approach is highly scalable, and considers structural differences between processes, thereby extracting needed information to understand performance differences. Existing approaches for comparing performance across executions , , , -,  either lack scalability or provide no capability to compare executions with structurally different processes within a single execution. At the same time, approaches that compare processes within a program run , ,  either do not consider program structure or lack scalability. Our contributions in this paper include: • A similarity measure based on execution structure, • A method to efficiently group processes using our similarity measure, • An extended similarity measure that discerns a special class of differences, e.g. parent-child relations between processes, • Usage scenarios that detail how our grouping approach can aid existing analysis approaches, and • An application study with performance measurements from a wide range of applications at up to 65,536 processes. We found that our structural pre-clustering approach can greatly reduce the amount of data needed for application analysis. For the traces we analyzed in our application study, most could be reduced to less than 10 process traces, down from the original process counts, e.g., 16,384 for PEPC.