The raincore distributed session service for networking elements

C.C. Fan, J. Bruck
Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001  
Motivated by the explosive growth of the Internet, we study efficient and fault-tolerant distributed session layer protocols for networking elements. These protocols are designed to enable a network cluster to share the state information necessary for balancing network traffic and computation load among a group of networking elements. In addition, in the presence of failures, they allow network traffic to fail-over from failed networking elements to healthy ones. To maximize the overall network
more » ... the overall network throughput of the networking cluster, we assume a unicast communication medium for these protocols. The Raincore Distributed Session Service is based on a fault-tolerant token protocol, and provides group membership, reliable multicast and mutual exclusion services in a networking environment. We show that this service provides atomic reliable multicast with consistent ordering. We also show that Raincore token protocol consumes less overhead than a broadcast-based protocol in this environment in terms of CPU task-switching. The Raincore technology was transferred to Rainfinity, a startup company that is focusing on software for Internet reliability and performance. Rainwall, Rainfinity's first product, was developed using the Raincore Distributed Session Service. We present initial performance results of the Rainwall product that validates our design assumptions and goals. Page 2 hope that using Raincore, application developers will easily be able to create application solutions that run on top of a cluster of networking elements. The traffic load will be shared among the nodes in the cluster, and failures will not affect the availability of the overall network service. Raincore Transport Service Raincore Distributed Data Service Raincore Distributed Session Service Applications OSI Layer 4 Transport Layer OSI Layer 7 Application Layer OSI Layer 6 Presentation Layer OSI Layer 5 Session Layer Figure 2. Raincore Distributed Services Architecture As Figure 2 illustrates, the Raincore Distributed Services are mapped into Layer 4 through Layer 7 in the OSI 7 -layer networking model [Tanenbaum, 1996] . While the advent of Internet is bringing together communication and computation, part of this convergence is reflected by this architecture that builds distributed computing protocols into the communication stack. We focus in this paper on the Raincore Distributed Session Service. It situates on top of the Raincore Transport Service, and uses a fault -tolerant token-ring protocol to provide group communication to the upper layers. In particular, i t performs group membership management and reliable multicast with consistent ordering among the group members. In addition, it offers a mutual exclusion service that is useful for the Distributed Data Service, as well as the applications. Group communication is a key component to enable a distributed system for the network environment, similar to its importance to any other distributed systems. It is a necessary module to maintain the group membership of the cluster, as well as to share state information among the member nodes. Load balancing and fail-over can take place based on the cluster membership and other critical cluster information shared by the group communication module. This module can also be used to share arbitrary application state, to facilitate transparent fail-over of traffic from a failed node to a healthy node, without the clients or the servers aware of the failures. Group communication in a distributed system is a challenging topic that has been studied extensively, both in theory [Fischer et al., 1985] [Birman and Joseph, 1987] [ Moser et al., 1994] and in practice. The key functionalities that a group communication system must provide are group membership agreement and atomic reliable multicast with consistent ordering. A number of distributed systems projects have been implemented with group communication modules at their cores. Examples include the ISIS and HORUS projects van Renesse, 1994] [van Renesse et al., 1994] , the TRANSIS project [Amir et al., 1992] , the TOTEM project [Amir et al., 1995] , and the MPI project [Gropp et al., 1999] . A broadcast medium is assumed for these projects, and important concepts and novel algorithms have been invented and implemented for a broadcast environment to provide reliable mu lticast with multiple levels of consistency. These projects have greatly helped the advancement of both scientific distributed computing and parallel database implementations. We focus our attention on the networking elements on the Internet. In this unique environment, the group communication is happening at the same time as the regular network traffic being processed by the members of the distributed system. Our design goal is then to have robust and consistent group communication among the member nodes, with minimal extra overhead on the CPU and the network. The main purpose of the group communication is to share among the networking devices the cluster state and the state information associated with the regular network traffic, so that load balancing and fail-over can occur smoothly. This will enable a cluster of networking elements to have the maximum throughput and the highest reliability. There are two implications from this unique property:
doi:10.1109/ipdps.2001.925154 dblp:conf/ipps/FanB01 fatcat:3wlt4vwhjbfsvmcete6wg2ccm4