Fast crash recovery in RAMCloud

Diego Ongaro, Stephen M. Rumble, Ryan Stutsman, John Ousterhout, Mendel Rosenblum
2011 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles - SOSP '11  
RAMCloud is a DRAM-based storage system that provides inexpensive durability and availability by recovering quickly after crashes, rather than storing replicas in DRAM. RAMCloud scatters backup data across hundreds or thousands of disks, and it harnesses hundreds of servers in parallel to reconstruct lost data. The system uses a log-structured approach for all its data, in DRAM as well as on disk; this provides high performance both during normal operation and during recovery. RAMCloud employs
more » ... andomized techniques to manage the system in a scalable and decentralized fashion. In a 60-node cluster, RAMCloud recovers 35 GB of data from a failed server in 1.6 seconds. Our measurements suggest that the approach will scale to recover larger memory sizes (64 GB or more) in less time with larger clusters. • Harnessing scale: RAMCloud takes advantage of the system's large scale to recover quickly after crashes. Each server scatters its backup data across all of the other servers, allowing thousands of disks to participate in recovery. Hundreds of recovery masters work together to avoid network and CPU bottlenecks while recovering data. RAMCloud uses both data parallelism and pipelining to speed up recovery. • Log-structured storage: RAMCloud uses techniques similar to those from log-structured file systems [21] , not just for information on disk but also for information in DRAM.
doi:10.1145/2043556.2043560 dblp:conf/sosp/OngaroRSOR11 fatcat:iglpm5pr55eajbwylbjhpebxe4