MODELS OF DISTRIBUTED-SHARED-MEMORY ON AN INTERCONNECTION NETWORK FOR BROADCAST COMMUNICATION

CONSTANTINE KATSINIS
2003 Journal of Interconnection Networks (JOIN)  
Due to advances in fiber-optics and VLSI technology, interconnection networks which allow multiple simultaneous broadcasts are becoming feasible. This paper examines the performance of distributed-shared-memory (DSM) systems based on the Simultaneous Optical Multiprocessor Exchange Bus (SOME-Bus) using queuing network models and develops theoretical results which predict processor utilization, message latency and other useful measures. It also presents simulation results which compare the
more » ... mance of the SOME-Bus, the mesh and the torus using queuing-network models. The SOME-Bus is a low-latency, high-bandwidth, fiber-optic interconnection network which directly links arbitrary pairs of processor nodes without contention, and can efficiently interconnect over one hundred nodes. It contains a dedicated channel for the data output of each node, eliminating the need for global arbitration and providing bandwidth that scales directly with the number of nodes in the system. Each of the N nodes has an array of receivers, with one receiver dedicated to each node output channel. No node is ever blocked from transmitting by another transmitter or due to contention for shared switching logic. The entire N-receiver array can be integrated on a single chip at a comparatively minor cost resulting in o(N) complexity. The SOME-Bus has much more functionality than a crossbar by supporting multiple simultaneous broadcasts of messages, allowing cache consistency protocols to complete much faster. The effect of collective communications due to cache coherence is examined. Results reveal that the performance of the SOME-Bus interconnection network is the least affected by large communication times, compared to the other two architectures considered here. Even in the presence of intense coherence traffic, processor utilization and message latency is much less affected than in the other architectures. C. Katsinis: SOME-DSM Models, Journal of Interconnection Networks, v. 4, n. 1, March 2003 2/22 Introduction High-performance computing is required for many applications, including simulation of physical phenomena, simulation of integrated circuits and neural networks, weather modeling, aerodynamics, and image processing. It has been relying increasingly on microprocessor-based computer nodes, groups of which are interconnected to form a distributed-memory multicomputer system. Such systems are scalable and capable of high computing power. Processes on different nodes communicate by passing messages. Programmers must use send/receive primitives and manage the distribution of data explicitly, a task which is becoming more difficult as application size increases. Many parallel applications are easier to formulate and solve using the shared-memory paradigm rather than message passing. Traditional shared-memory systems offer a general and convenient programming model, but, because processors are tightly coupled, they experience increased contention and larger latency as the system size increases. A distributed-shared-memory (DSM) system can be viewed as a set of nodes or clusters, with local memories, communicating over an interconnection network. It hides the message-passing mechanism and provides a shared-memory model, attempting to combine ease of programming and reduced contention. DSM relies on management agents which use message passing to map the shared logical address space onto local memories and keep it coherent at all times. On each access to shared space, hardware must determine if the requested data is in the local memory, and if not, the data must be copied from remote memory. Actions are also needed when data is written in shared space to preserve the coherence of shared data. An important objective of current research in DSM systems is the development of approaches that minimize the access time to shared data, while maintaining data consistency. Their performance depends on how restrictive the memory consistency model is. Models with strong restrictions result in increased access latency and network bandwidth requirements. Sophisticated models with weaker constraints have been proposed and implemented. They allow reordering, pipelining, and overlapping of memory and consequently produce better performance, but also require that accesses to shared data be synchronized explicitly, resulting in higher programmer involvement and inconvenience. Such models are useful but, by requiring additional effort by the programmer, they result in problems which the shift away from the message passing paradigm is trying to avoid. The success of DSM depends on its ability to free the programmer from any operations that are necessary for the only purpose of supporting the memory model. This implies that the most successful consistency model may be one with larger access latency and bandwidth requirements than the possible minimum, and therefore it is critical that interconnection networks be developed, connecting hundreds of nodes with high-bisection bandwidth and low latency, that result in the least possible adverse impact on DSM performance. The effects of interconnection network properties and data consistency protocols have been the focus of extensive research. A DSM multiprocessor based on a twodimensional mesh is examined in [16] using a queuing network model and simulation. For large values of remote memory request probability, they find that the interconnection network saturates, and processor utilization stays below 35%. Both theoretical and simulation techniques are used in [15] to study a clustered DSM multiprocessor with crossbars interconnecting processors and memories with a cluster, and processors with global memory. As the probability of a memory access per cycle increases, they find the top performance of the system to be approximately 25% of
doi:10.1142/s021926590300074x fatcat:hggkb6sjarcxtkd3eybvmet7xi