Performance optimisations for distributed analysis in ALICE

L Betev, A Gheata, M Gheata, C Grigoras, P Hristov
2014 Journal of Physics, Conference Series  
Performance is a critical issue in a production system accommodating hundreds of analysis users. Compared to a local session, distributed analysis is exposed to services and network latencies, remote data access and heterogeneous computing infrastructure, creating a more complex performance and efficiency optimization matrix. During the last 2 years, ALICE analysis shifted from a fast development phase to the more mature and stable code. At the same time, the frameworks and tools for
more » ... ools for deployment, monitoring and management of large productions have evolved considerably too. The ALICE Grid production system is currently used by a fair share of organized and individual user analysis, consuming up to 30% or the available resources and ranging from fully I/O-bound analysis code to CPU intensive correlations or resonances studies. While the intrinsic analysis performance is unlikely to improve by a large factor during the LHC long shutdown (LS1), the overall efficiency of the system has still to be improved by an important factor to satisfy the analysis needs. We have instrumented all analysis jobs with "sensors" collecting comprehensive monitoring information on the job running conditions and performance in order to identify bottlenecks in the data processing flow. This data are collected by the MonALISa-based ALICE Grid monitoring system and are used to steer and improve the job submission and management policy, to identify operational problems in real time and to perform automatic corrective actions. In parallel with an upgrade of our production system we are aiming for low level improvements related to data format, data management and merging of results to allow for a better performing ALICE analysis.
doi:10.1088/1742-6596/523/1/012014 fatcat:w4q7fzr6qnhxdjzbtbfwntt55m