Reliability, Availability, and Serviceability of IBM Computer Systems: A Quarter Century of Progress

M. Y. Hsiao, W. C. Carter, J. W. Thomas, W. R. Stringfellow
1981 IBM Journal of Research and Development  
Computer systems have achieved significant progress in the areas of technology, performance, capability, and RAS (reliabilitylavailabilitylserviceability) during the last quarter century. In this paper, we shall review the advances of IBM computer systems in the RAS area. This progress has for the mostpart been evolutionary; however, in some cases it has been revolutionary. RAS developments have been driven primarily by technological advances and by increases in functional capability and
more » ... pability and complexity, but RAS considerations have also played a leading role and have improved technological and functional capability. The paper briejly reviews the progress of computer technology. It points out how IBM has maintained or improved its systems RAS capabilities in the face of the greatly increased number of components and system complexity by improved system recovery and serviceability capability, as well as by basic improvements in intrinsic component failure rate. The paper also covers the CPU, tape, and disk areas and shows how RAS improvements in these areas have been significant. The main objective is to provide a comprehensive view of signijkant developments in the RAS characteristics of IBM computer systems over the past twenty-jive years. Introduction and general concepts Reliability is a measure of the consistency with which a system successfully provides its specified services. Serviceability is a measure of the ease with which the system is restored to its specified state. Availability is the percentage of the time during which the system is providing that specified service [l]. The characteristics of and the effect on the system with regard to these three interrelated quantities are referred to as the system RAS. The central issue in designing systems with good RAS characteristics is recovery-reduction of fault occurrence, detection and counteraction of errors [2], and efficient repair procedures. Recovery implies resumption of operation with data integrity. Figure 1 illustrates the basic relationship between faults and system RAS for a unified hardware/system of U S . In the center circle, 1 B IBM J. RES. DEVELOP. VOL. 25 NO. 5 SEPTEMBER 1981 system faults may be caused by the intrinsic device failure rate, by design faults, or by outside interference. When the fault causes an error, the first line of defense is error detection, followed by error correction (usually with error-correction codes) or by retry. If the erroneous effect of the fault no longer exists, operation continues without repair in the reliable state. If these mechanisms do not work, the effects of the error propagate to a subsystem, and error recovery usually proceeds using an error-recovery program, with deletion of the offending subsystem. If the error cannot be contained within a subsystem, other methods of correction are needed, possibly with human intervention. However, the system still provides some service. Finally, the system may not be able to proceed at all and immediate repair is necessary. In this case, serviceability is important for efficient restoration of service. Copyright 1981 by International Business Machines Corporation. Copying is permitted without payment of royalty provided that (1) each reproduction is done without alteration and (2) the Journal reference and IBM copyright notice are included on the first page. The title and abstract may be used without further permission in computer-based and other information-service systems. Permission to republish other excerpts should be obtained from the Editor. 453 M. Y. HSIAO ET AL. M. Y . HSIAO ET AL IBM J. RES. DEVELOP. VOL. 25 NO.
doi:10.1147/rd.255.0453 fatcat:5cysi4lynvaj5ajw2ihmhl7hv4