Reliability mechanisms for very large storage systems

Qin Xin, E.L. Miller, T. Schwarz, D.D.E. Long, S.A. Brandt, W. Litwin
20th IEEE/11th NASA Goddard Conference on Mass Storage Systems and Technologies, 2003. (MSST 2003). Proceedings.  
Reliability and availability are increasingly important in large-scale storage systems built from thousands of individual storage devices. Large systems must survive the failure of individual components; in systems with thousands of disks, even infrequent failures are likely in some device. We focus on two types of errors: nonrecoverable read errors and drive failures. We discuss mechanisms for detecting and recovering from such errors, introducing improved techniques for detecting errors in
more » ... k reads and fast recovery from disk failure. We show that simple RAID cannot guarantee sufficient reliability; our analysis examines the tradeoffs among other schemes between system availability and storage efficiency. Based on our data, we believe that two-way mirroring should be sufficient for most large storage systems. For those that need very high reliabilty, we recommend either three-way mirroring or mirroring combined with RAID.
doi:10.1109/mass.2003.1194851 dblp:conf/mss/XinMSLBL03 fatcat:cystlukppvftzmtw6qlmkwsosq