ATLAS and LHC computing on CRAY

F G Sciacca, S Haug
2017 Journal of Physics, Conference Series  
Access and exploitation of large scale computing resources, such as those o↵ered 6 by general purpose HPC centres, is one important measure for ATLAS and the other Large 7 Hadron Collider experiments in order to meet the challenge posed by the full exploitation of 8 the future data within the constraints of flat budgets. We report on the e↵ort of moving the 9 Swiss WLCG T2 computing, serving ATLAS, CMS and LHCb, from a dedicated cluster to the 10 large Cray systems at the Swiss National
more » ... puting Centre CSCS. These systems do not 11 only o↵er very e cient hardware, cooling and highly competent operators, but also have large 12 backfill potentials due to size and multidisciplinary usage and potential gains due to economy 13 at scale. Technical solutions, performance, expected return and future plans are discussed. 14 TOP500 list [3]. The HPCs are generally restricted and self-contained environments, subject 38 to tight access rules. Consequently, their integration in a distributed computing environment 39 poses challenges. Further challenges are posed by the general nature of these systems. For 40 example, the processor architecture and/or the OS might not always be suitable. The application 41 provisioning itself is not trivial, considering that a single ATLAS release is about 20GB in size 42 and the release cycles are very short and unpredictable. Furthermore, the integration with 43 the experiment factories requires generally running middleware services at the site as root, in 44 addition to outbound IP connectivity, both of which are not generally allowed at HPC centres. 45 For real data processing, he data exchange systems should be able to cope with rates of about 46 0.2MB/sec per core for input and about half of that for output, continuously. 47 69 possible cost-advantages of migrating all the WLCG Tier-2 workloads from the dedicated Tier-2 70 linux cluster Phoenix to the flagship Cray Piz Daint at CSCS. In addition, the integration of the 71 Grid Storage Element within the central storage infrastructure at CSCS is considered for the 72 future. We have been granted access to the Cray Brisi, the Test Development System (TDS) for 73 Piz Daint. The system is a Cray XC40 featuring Broadwell (Intel Xeon E5-2695 v4 @2.10GHz) 74 CPUs with 64 HT-cores, 128GB RAM and a hepspec06 rating of 13.4/core. The nodes are 75 diskless, run the latest version of the Cray Linux Environment 6 (which is based on SUSE 12), 76 the interconnect is Cray Aries and the LRMS is slurm. 77 Since CSCS has o cially endorsed the e↵ort, some of the integration challenges have been 78 automatically addressed, so that we could better concentrate on performance and scalability. 79 However, several adaptations were needed for the CSCS infrastructure (mainly network). These 80 have required a fair amount of work, constituting the first stage of the project, the feasibility 81 study. With the second phase, which has just been concluded, we have concentrated on moving 82 to a stage somewhere in between proof-of-concept and pre-production. The work has focussed 83 on integrating as many workloads as possible by the three experiments, address the application 84 software provisioning and the fair-share complexities. Some preliminary performance studies 85 have been performed for ATLAS. In a future phase, evaluation at scale is foreseen. This should 86 provide detailed performance metrics to feed back to the cost study. The integration scheme 87 can be seen in Fig. 2. 88 Figure 2. LHConCray current shared architecture. 4.1. Technical Solutions 89 We have seen earlier what the main challenges for integrating a Cray into the WLCG Grid are. 90 During the integration phase we have addressed all of them to some level. The solutions adopted 91 are listed below. Some of them will require testing at scale in order to be validated or further 92 tuned for performance. 93 4.1.1. Processor architecture and OS, memory management. The processor architecture of the 94 Piz Daint Cray is what the WLCG environment expects. However, the Operating System 95 is not. While it is true that the latest version of the Cray Linux Environment abandons 96 many proprietary solutions and makes the Cray a more linux-like system, thus facilitating the 97 integration process, it cannot be expected that all type of workloads could run unmodified on 98 it. In order to address this, jobs run within Shifter containers. The container itself is a CentOS 99 6.8 full image with the same packages as in the dedicated WLCG Tier-2 cluster and configured 100 accordingly. The main challenge of this approach has been reducing the memory overhead 101 caused by the container itself. This is crucial, since the memory specification of the Cray 102 nodes is quite tight at 2GB/core and no swap. In addition, the di↵erent memory requirements 103 and management strategies of each experiment are quite orthogonal between each other, making 104 sharing the system between the three of them particularly challenging. In slurm, jobs are allowed 105 to be not node-exclusive (a departure from the typical HPC model), so that running single or 106 low-core count jobs would not result in part of the resources being left idle. This also means that 107 memory limits must be carefully considered and enforced in order to ensure a smooth running 108 of the system. This introduces further complications in the fair-share settings and management, 109 which is essentially the core of the HPC vs. HTC scheduling challenge. 110 4.1.2. Compliance with tight access rules. As the policies have been relaxed for this specific 111 use case, we no longer need to restrict all access to occur via ssh under a single user. The pool 112 accounts needed for the integration with the experiment frameworks have been granted and we 113 have been able to integrate the ARC CE services directly inside the Cray high speed network. 114 4.1.3. Application provisioning. The integration of the middleware services goes further to 115 allow the root fuse mount required on the compute nodes in order to use cvmfs as software 116 repository for the three experiments. However, the nodes are diskless, so the local cvmfs cache 117 cannot be deployed simply. After trying a few di↵erent approaches, including pre-loading the 118 entire cvmfs stratum-0 repository contents on to the Cray shared file system, the approach 119 chosen is currently to deploy the cache as a single xfs file system per node. The file system itself 120 consists in turn of a single sparse xfs file that can be located anywhere on the internal network. 121 This has the advantage of reducing drastically the number of meta-data operations involved and 122 also provides the flexibility of moving the caches to more performant or even ad-hoc file systems, 123 should the need arise (in terms of performance). This solution still needs testing at a scale and 124 perhaps consequent optimisation. 125 4.1.4. Workload management integration. The ARC CE integrates by design with the ATLAS 126 factories, furthermore its integration with CMS and LHCb has been relatively recently 127 established in production by the experiments. The ARC CE thus ensures the main layer of 128 integration for the Cray resources. Lack of outbound IP connectivity can be worked around 129 by ATLAS, but not (easily) by CMS and LHCb. Consequently the restricted outbound IP 130 connectivity policy has been lifted, and the Cray nodes have now even public IPs with standard 131 linux IP packet forwarding. The internal network infrastructure has been adapted in order to 132 ensure that all the security policies are met by the setup. 133 4.1.5. Data input, processing and retrieval. For ATLAS, we leverage the ARC CE technology 134 that provides its own Data Delivery services. These are demonstrated to be su ciently 135 performant at scale. Since the ARC CE is integrated in the Cray high speed network and 136 full IP connectivity is ensured, both functionality and performance at scale should be ensured. 137 CMS and LHCb make use of the tools built in their pilot frameworks, thus also benefiting 138 from the node full IP connectivity. It should be noted that with diskless nodes, the shared file 139 systems serving the Cray become central to the performance of the system. These are normally
doi:10.1088/1742-6596/898/8/082004 fatcat:uigofa7pbvdgbos5sbggufyouu