Flexible use of memory for replication/migration in cache-coherent DSM multiprocessors
SIGARCH Computer Architecture News
Given the limitations of bus-based multiprocessors, CC-NUMA is the scalable architecture of choice for shared-memory machines. The most important characteristic of the CC-NUMA architecture is that the latency to access data on a remote node is considerably larger than the latency to access local memory. On such machines, good data locality can reduce memory stall time and is therefore a critical factor in application performance. In this paper we study the various options available to system
... igners to transparently decrease the fraction of data misses serviced remotely. This work is done in the context of the Stanford FLASH multiprocessor. FLASH is unique in that each node has a single pool of DRAM that can be used in a variety of ways by the programmable memory controller. We use the programmability of FLASH to explore different options for cache-coherence and datalocality in compute-server workloads. First, we consider two protocols for providing base cache-coherence, one with centralized directory information (dynamic pointer allocation) and another with distributed directory information (SCI). While several commercial systems are based on SCI, we find that a centralized scheme has superior performance. Next, we consider different hardware and software techniques that use some or all of the local memory in a node to improve data locality. Finally, we propose a hybrid scheme that combines hardware and software techniques. These schemes work on the same base platform with both user and kernel references from the workloads. The paper thus offers a realistic and fair comparison of replication/migration techniques that has not previously been feasible. • CC-NUMA+MigRep: As an enhancement to CC-NUMA the kernel can perform page-level migration and replication to increase locality. We modify the cache-miss handlers to use a small portion of the DRAM to keep per-page-per-node miss counts that are used by the kernel to make migration and replication decisions. There are several unique aspects to this study: (i) Since all of the schemes are implemented on the same piece of hardware, and all protocols are complete and working implementations (including full operating system modifications), this study offers the first realistic and fair comparison of these protocols; (ii) The workloads studied include all operating system effects and kernel references (e.g., earlier COMA studies considered user-mode references only); (iii) This wide variety of schemes, especially including kernel-based migration/replication, has never been pulled together and compared before. There are two ways in which to interpret the results presented in this paper. First, the results can act as a guide to the desirability of implementing each of the individual schemes in the study. Second, the flexibility of FLASH-like machines in implementing various protocols allows for choosing a specific scheme to achieve the best performance for a given workload. Our comparison of the DynPtr and SCI protocols shows that the centralized directory information in DynPtr leads to a simpler design and yields superior performance. For this reason, the rest of our experiments assume DynPtr as the base CC-NUMA protocol. Our data-locality results show that the simple RAC scheme can be implemented with low additional complexity, and is effective in improving performance (up to 64% faster than base CC-NUMA) by caching data in part of the local memory. However, the gains are quite sensitive to the size of the RAC. In particular, performance can degrade when the RAC is too small to capture the remote working set of the application, or when most of the misses are due to coherence. The COMA protocol also improves execution time (up to 14%) when the working set of the application is large and capacity misses dominate. However, it is complex to implement (both in amount of protocol code required and number of instructions executed by the protocol processor), and the performance can be significantly worse if coherence misses are dominant. Both RAC and COMA are quite effective in increasing locality for both user and kernel references. However, RAC is always superior to COMA, given our base parameters and workloads. Kernel-based migration and replication requires the least changes to the base CC-NUMA protocol, and does quite well (up to 56% faster than base CC-NUMA) when sharing is coarsegrain and pages are mostly read-only. We also found that the kernel-based and RAC schemes complement each other. We propose a hybrid scheme called MIGRAC: kernel-based migration/replication handles coarse-grain locality decisions while the RAC protocol exploits fine-grain locality. The rest of this paper is organized as follows. Section 2 describes the architecture of the FLASH machine. Section 3 presents a detailed description of the various protocols, and provides a qualitative analysis of their effectiveness for different cache-miss types. We describe our experimental environment and workloads in Section 4. Section 5 presents the performance results for each of our schemes. In Section 6, we explore the sensitivity of the different schemes to various parameters such as memory pressure and protocol processing speed. Section 7 proposes and evaluates the hybrid MIGRAC scheme. Finally, we discuss related work and summarize our major results.