Abstracting Multi-Core Topologies with MCTOP

Georgios Chatzopoulos, Rachid Guerraoui, Tim Harris, Vasileios Trigonakis
2017 Proceedings of the Twelfth European Conference on Computer Systems - EuroSys '17  
Portability and efficiency are usually antagonists in multicore computing. In order to develop efficient code, one needs to take into account the topology of the target multi-cores (e.g., for locality). This clearly hampers code portability. In this paper, we show that you can have the cake and eat it too. We introduce MCTOP, an abstraction of multi-core topologies augmented with important low-level hardware information, such as memory bandwidths and communication latencies. We show how to
more » ... atically generate MCTOP using libmctop, our library that leverages the determinism of cache-coherence protocols to infer the topology of multi-cores using only latency measurements. MCTOP enables developers to accurately and portably define high-level performance optimization policies. We illustrate several such policies through four examples: (i-ii) thread placement in OpenMP and in a MapReduce library, (iii) a topology-aware mergesort algorithm, as well as (iv) automatic backoff schemes for locks. We illustrate the portability of these optimizations on five processors from Intel, AMD, and Oracle, with low effort. libnuma [47] on Linux, liblgrp [6] on Solaris, or hwloc [19]). These libraries offer a topology representation of multi-cores, as well as a companion interface for placing threads (and data). However, the provided representations are low-level and offer only the limited topology view of the operating system. As such, developers do not have access to the performance characteristics of the underlying multicore processor. Moreover, developers still need to optimize their software for each platform. For example, they need to manually identify the hardware contexts that belong to the same cores (usually for avoiding them), calculate the bestconnected sockets, and consult processor manuals for discovering the actual topology of their multi-core. The result is ad-hoc implementations, tied to the underlying platform. We present in this paper an easier, more portable approach to optimizing software for multi-cores. We introduce MCTOP, a multi-core topology abstraction of important low-level information, such as communication latencies and memory bandwidths. MCTOP is automatically generated and exposed to software developers by our libmctop userlevel library. Figure 1 depicts the visual representation of the MCTOP of an AMD Opteron. Of course, a developer could directly use this low-level information to fine-tune her software for this Opteron. For instance, she could decide to use sockets 0 and 1 as they are connected with minimum latency. However, such optimizations-that rely on the specifics of a processor-are not portable. Instead, she could write a policy that uses any two sockets (if available) that minimize latency. MCTOP enables the design of easy, portable, and efficient optimizations using such high-level policies. In turn, these policies make use of the actual numbers included in MCTOP. Essentially, MCTOP allows developers to express high-level semantics that utilize the low-level performance details of multi-cores, thus delivering portable optimizations. For instance, using MCTOP, we can easily define policies such as "use one hardware context per core," "use two sockets with maximum bandwidth," or even "use the maximum number of threads, in the two most remote sockets, so that each thread has access to at least 3 MB of LLC." libmctop is based on MCTOP-ALG, our novel algorithm for inferring the topology of multi-cores. MCTOP-ALG relies on two fundamental observations: (i) cache-coherence protocols are deterministic in the absence of contention, and (ii) communication latencies characterize the topology. These observations are in accordance with the network view of multi-cores that has been proposed for OS design [12, 13, 70]. MCTOP-ALG leverages these two observations by collecting accurate core-to-core communication latencies, which are used to infer the topology of the processor. On top of this topology, libmctop collects additional low-level measurements, such as cache latencies and memory latencies/bandwidths. The end result is an automaticallygenerated MCTOP representation of the multi-core. We argue that MCTOP-ALG's measurement-based approach is superior to loading multi-core topologies from the underlying OS or hardware (e.g., using CPUID) for various reasons: (i) portability-collecting measurements is almost identical on any architecture or OS, unlike reading topology info from the OS or the hardware; (ii) forward/backwards compatibility-measurements do not depend on the OS version; (iii) correctness-numbers do not lie, while the OS can be misconfigured 1 ; (iv) extensibility-independence
doi:10.1145/3064176.3064194 dblp:conf/eurosys/ChatzopoulosG0T17 fatcat:xoxdudvpx5dcxph5lnkokoe2yy