Performability Modeling And Analysis Of Fault Tolerance Support In Communication Protocols

Samian Kaur
2000
There has been much research in assessing the performance of different messaging systems, but often messaging systems cannot be completely expressed by performance metrics alone. For an emerging class of large-scale distributed servers, robustness is at least as as important as performance. Three factors make protocol robustness critical: (i) these servers have very high availability requirements (e.g., minutes of down-time per year), implying that even occasional message loss cannot be
more » ... phic; (ii) intra-server communication depends on external client service demands, making it extremely difficult to exert enough control over the system "by design" to avoid message loss; and (iii) many commodity LANs do not implement sufficient hardware flow control to always prevent loss inside the network under arbitrarily adverse communication patterns. Most of the current paradigms of reliable communication either provide strong consistency semantics with high overhead (e.g. transactional RPC) or reliability with indeterminate failure states using retransmissions(e.g.,TCP/IP).This work aims at building a new messaging layer that provides additional recovery states for applications to allow designers to reason about the cause of the error and to build customized recovery mechanisms. We present the design and implementation of a high performance Active Message (AM) layer over the Virtual Interface Architecture (VIA) library as such a messaging infrastructure. Its performance is evaluated to ensure that the additional recovery states are achieved at a reasonable overhead. We then present a queuing model to allow in the analysis and evaluation of robustness of this messaging protocol by computing its performance as a function of dependability in the presence of component and overall failures.
doi:10.7282/t39027dx fatcat:ipmnqcgw2vae5e46up4svjqlca