Fast Reconstruction for Degraded Reads and Recovery Process in Primary Array Storage Systems

Baegjae SUNG, Chanik PARK
2017 IEICE transactions on information and systems  
Baegjae SUNG †a) , Nonmember and Chanik PARK †b) , Member SUMMARY RAID has been widely deployed in disk array storage systems to manage both performance and reliability simultaneously. RAID conducts two performance-critical operations during disk failures known as degraded reads/writes and recovery process. Before the recovery process is complete, reads and writes are degraded because data is reconstructed using data redundancy. The performance of degraded reads/writes is critical in order to
more » ... tical in order to meet stipulations in customer service level agreements (SLAs), and the recovery process affects the reliability of a storage system considerably. Both operations require fast data reconstruction. Among the erasure codes for fast reconstruction, Local Reconstruction Codes (LRC) are known to offer the best (or optimal) trade-off between storage overhead, fault tolerance, and the number of disks involved in reconstruction. Originally, LRC was designed for fast reconstruction in distributed cloud storage systems, in which network traffic is a major bottleneck during reconstruction. Thus, LRC focuses on reducing the number of disks involved in data reconstruction, which reduces network traffic. However, we observe that when LRC is applied to primary array storage systems, a major bottleneck in reconstruction results from uneven disk utilization. In other words, underutilized disks can no longer receive I/O requests as a result of the bottleneck of overloaded disks. Uneven disk utilization in LRC is due to its dedicated group partitioning policy to achieve the Maximally Recoverable property. In this paper, we present Distributed Reconstruction Codes (DRC) that support fast reconstruction in primary array storage systems. DRC is designed with group shuffling policy to solve the problem of uneven disk utilization. Experiments on real-world workloads show that DRC using global parity rotation (DRC-G) improves degraded performance by as much as 72% compared to RAID-6 and by as much as 35% compared to LRC under the same reliability. In addition, our study shows that DRC-G reduces the recovery process completion time by as much as 52% compared to LRC. key words: array storage systems, RAID, erasure codes, fast reconstruction 1. We identified the reason for low reconstruction performance on LRC in primary array storage systems: un-in 1999. He has also visited Northwestern University and Yale University in 2008 and 2015, respectively. He has served at a number of international conferences as a program committee member. His research interests include storage systems, operating systems, and system security.
doi:10.1587/transinf.2016edp7208 fatcat:z6skj3lmtjgazemwk7rknevyxy