ParaStation: Efficient parallel computing by clustering workstations: Design and evaluation

Thomas M. Warschko, Joachim M. Blum, Walter F. Tichy
1998 Journal of systems architecture  
ParaStation is a communications fabric for connecting o -the-shelf workstations into a supercomputer. The fabric employs technology used in massively parallel machines and scales up to 4096 nodes. The message passing software p r eserves the low latency of the fabric by taking the operating system out of the communication path, while still providing full protection. The rst implementation of ParaStation using Digital's AlphaGeneration workstations achieves end-to-end process-to-process
more » ... as low as 2:5s and a sustained b andwidth of more than 10 MByte s per channel with small packets. Benchmarks using PVM on ParaStation demonstrate real application performance o f 1 GFLOP on an 8-node cluster. Introduction Networks of workstations and PCs o er a cost-e ective and scalable alternative to monolithic supercomputers. Thus, bundling together a cluster of workstations either single-processors or small multi-processors into a parallel system would seem to be a straightforward solution for computational tasks that are too large for a single machine. However, conventional communication mechanisms and protocols yield communication latencies that make only very large grain parallelism e cient. For example, typical parallel programming environments like PVM BDG + 93 , P4 BL92 and MPI CGH94 h a v e latencies of several milliseconds. As a consequence, the parallel grain size necessary to achieve acceptable e ciency has to be in the range of tens of thousands of arithmetic operations. In contrast, massively parallel systems MPPs o er an excellent communication computation ratio. But engineering lag time causes a widening gap to the rapidly increasing performance of state-of-the-art microprocessors and low-volume manufacturing results in a cost performance disadvantage. This situation is not unique to MPP systems; it applies to multiprocessor servers as well ACP95 . ParaStation's approach i s t o c o m bine the bene ts of a high-speed MPP network with the excellent price performance ration and the standardized programming interfaces of conventional workstations. Wellknown programming interfaces ensure portability o v er a wide range of di erent systems. The integration of a high-speed MPP network opens up the opporunity to eliminate as much communication oberhead as possible. The retargeted MPP-network of ParaStation was originally developed for the Triton 1 system HWTP93 and operates in a 256 node system. Key issues of the network design are based around autonomous distributed switching, hardware ow-control at link-level, and optimized protocols for point-to-point message passing. In a ParaStation system, this network is connected via PCI-bus interface boards to the host systems. The software design focuses at standardized programming interfaces UNIX-sockets, while preserving the low latency and high throughput of the MPP-network. ParaStation implements operating system functionality in user-space to minimize overhead, while providing the protection for a true multiuser multiprogramming environment. The current design is capable of performing basic communication operations with a total process-to-process latency of just a few microseconds i.e., 2:5s for a 32bit packet. Compared to workstation clusters using standard communication hardware e.g., Message-passing software such as PVM using Ethernet FDDI hardware, our system shows performance improvements of more than two orders of magnitude on communication benchmarks. As a result, application benchmarks i.e., ScaLAPACK equation solver and others execute with nearly linear speedup on a wide range of di erent problem sizes. Related Work There are several projects targeting low-latency and high throughput parallel computing on workstation clusters. MINI Memory-Integrated Network Interface MBH95 targets a 1-Gbps bandwidth with 1.2 s latency interconnect using an ATM network. Communication in MINI is based on Channels between participating processes using ATM's virtual channel concept. Performance gures ATM cell round-trip time of 3.9 s at 10Mbytes s are based on VHDL simulations; hardware development is in progress. SHRIMP Scalable High-Performance Really Inexpensive Multiprocessor BDF + 95 supports virtualmemory-mapped communication, allowing user processes to communicate without expensive bu er management and without system calls across the protection boundary separating user processes from the operation system kernel. Using Pentium PCs as platform, the network interface is connected to an EISA-Bus SHRIMP-I and the Xpress memory bus SHRIMP-II. A 16-node SHRIMP-I and a 2-node prototype SHRIMP-II was expected to be operational in 2Q 95. Myrinet BCF + 95 i s a n e w t ype of local area network based on technology used for packet communication and switching within massively parallel processors. Measured performance using Myrinet API functions achieve one-way, end-to-end rates of 250 Mbps on 8-Kbyte packets. Illinois Fast Messages PLC95 is a high speed messaging layer that delivers low latency and high bandwidth for short messages. On Myrinet-connected SPARCstations, one-way latencies of 25s were meassured for small packets and for large packets, bandwidth as large as 19.6 MByte s was achieved. Von Eicken et al. adapted Active Messages vECGS92, vEBB95 to a Sun workstation cluster interconnected by a n A TM network. The prototype implementation shows a peak bandwidth of 7.5 MByte s and a round-trip latency of 52 s. The Berkeley NOW Network of Workstations project ACP95 targets 100+ workstation clusters using o -the-shelf components. One initial prototype is a cluster of HP9000 735s using an experimental Medusa FDDI interface. The nal demonstration system will use either a second-generation ATM LAN or a retargeted MPP network, such as Myrinet. Digital's MemoryChannel Ros95 i s a l o w latency cluster interconnect and provides a shared memory space among interconnected systems. It achieves a hardware latency of 5s and an aggregate bandwidth of 100 MByte s on an 8-port hub. Sun's S-Connect NBKP95 is a high speed, scalable interconnect system that has been developed to support networks of workstations to share computing resources. In the S3.mp distributed, shared memory multiprocessor NAB + 94 , S-Connect switching fabrics deliver more than 100 MByte s user programm accessible bandwidth at latencies of about 1s. ATM as fast workstation interconnect promised high bandwidth links as well as low network latency. With vendor supplied device drivers, however, end-to-end latency for small packets is worse over ATM than over Ethernet KAP95, BBVvE95 . In contrast to most other approaches, we focus on a pure message passing environment rather than a virtual shared memory. A s v on Eicken at al. pointed out vEBB95 , recent w orkstation operating systems do not support a uniform address space, so virtual shared memory is di cult to maintain. Common to Active Messages and Fast Messages, performance improvement is based on user-level access to the network, but in contrast to them, we provide multiuser multiprogramming capabilities. Like Myrinet, S-Connect, our network was originally designed for a MPP System Triton 1 and is now retargeted to a workstation cluster environment. Myrinet, IBM-SP2, and Digital's Memory Channel use central switching fabrics, while ParaStation provides distributed switches on each i n terface board. Performance Hurdles in Workstation Clusters Using existing workstation clusters as a virtual supercomputer su ers from several problems related to standard communication hardware, traditional approaches in the operating system, and the design of widely used programming environments. Standard communication hardware i.e.: Ethernet, FDDI, ATM was developed for a LAN WAN environment rather than MPP-communication. Network links are considered unreliable, so higher protocol layers must detect packet loss or corruption and provide retransmission. Common network topologies, such as a bus or a ring, do not scale very well. Sharing one physical medium among the connected workstations results in se-
doi:10.1016/s1383-7621(97)00039-8 fatcat:qtrrift3gnde3porf6565p56ty