Program Development Tools and Infrastructures
[report]
M Schulz
2012
unpublished
MOTIVATION Exascale class machines will exhibit a new level of complexity: they will feature an unprecedented number of cores and threads, will most likely be heterogeneous and deeply hierarchical, and offer a range of new hardware techniques (such as speculative threading, transactional memory, programmable prefetching, and programmable accelerators), which all have to be utilized for an application to realize the full potential of the machine. Additionally, users will be faced with less
more »
... per core, fixed total power budgets, and sharply reduced MTBFs. At the same time, it is expected that the complexity of applications will rise sharply for exascale systems, both to implement new science possible at exascale and to exploit the new hardware features necessary to achieve exascale performance. This is particularly true for many of the NNSA codes, which are large and often highly complex integrated simulation codes that push the limits of everything in the system including language features. To overcome these limitations and to enable users to reach exascale performance, users will expect a new generation of tools that address the bottlenecks of exascale machines, that work seamlessly with the (set of) programming models on the target machines, that scale with the machine, that provide automatic analysis capabilities, and that are flexible and modular enough to overcome the complexities and changing demands of the exascale architectures. Further, any tool must be robust enough to handle the complexity of large integrated codes while keeping the user's learning curve low. With the ASC program, in particular the CSSE (Computational Systems and Software Engineering) and CCE (Common Compute Environment) projects, we are working towards a new generation of tools that fulfill these requirements and that provide our users as well as the larger HPC community with the necessary tools, techniques, and methodologies required to make exascale performance a reality. PERFORMANCE ANALYSIS TOOLS To reach exascale, performance tools will no longer be a luxury for power users, but an essential infrastructure for guiding both applications and the software stack developers to exascale. Performance tools for exascale systems must help developers identify shortcomings in the software stack as well as provide on-line performance feedback to guide runtime adaptation. As such, we require a series of tool sets that not only provide a range of advanced analysis capabilities, but that are intuitive and easy-to-use as well as available across a range of platforms. The Open|SpeedShop (O|SS) [1] project, a joint project between LANL, LLNL, SNLs and the Krell Institute, specifically targets the ideas of ease-of-use and cross-platform availability. Installed on most major DOE (NNSA and ASCR) laboratory systems it provides users with rich set of performance analysis functionality, incl. PC sampling, call stack sampling, hardware counter experiments, as well as a wide range of tracing experiments. User can apply O|SS using simple prefix commands and can analyze the data in a comprehensive GUI. Tools like Loba, developed at LANL, offer a different avenue: it provides a simple tool framework that attempts to automate the collection, evaluation, and application of mapping heuristics for MPI applications, an issue of rising importance on today's multi-core/multi-socket node systems. It features a range of mapping algorithms and uses profiling data to generate new MPI task placements for subsequent runs with similar communication characteristics. Experiments with several benchmarks have shown performance improvements of up to 16%. DEBUGGING AND VERIFICATION TOOLS The increased complexity and core counts of exascale systems will diminish the effectiveness of traditional interactive debuggers. To cope with the complexity of exascale executions, application developers will need additional tools that can help users to either automatically or semi-automatically reduce the problem to smaller core counts or to detect the problem itself. The Stack Trace Analysis Tool (STAT) targets this problem and provides is a lightweight and highly scalable mechanism for identifying errors in code running at full scale [2]. It has been developed in close collaboration between LLNL, the University of Wisconsin, and the University of New Mexico, and works on the principle of detecting and grouping similar processes at suspicious points in a programs execution. STAT gathers stack traces across tasks and over time and merges the traces into a call graph prefix tree, from which it identifies the task equivalence classes. The tool has proven effective even at very large scales; it has demonstrated sub-second merging latencies on 212,992 tasks [3].
doi:10.2172/1037840
fatcat:riqq7qt4avghtg4imdp2lfpvym