VPC: Scalable, Low Downtime Checkpointing for Virtual Clusters

Peng Lu, Binoy Ravindran, Changsoo Kim
2012 2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing  
A virtual cluster (VC) consists of multiple virtual machines (VMs) running on different physical hosts, interconnected by a virtual network. A fault-tolerant protocol and mechanism are essential to the VC's availability and usability. We present Virtual Predict Checkpointing (or VPC), a lightweight, globally consistent checkpointing mechanism, which checkpoints the VC for immediate restoration after VM failures. By predicting the checkpoint-caused page faults during each checkpointing interval,
more » ... kpointing interval, VPC further reduces the solo VM downtime than traditional incremental checkpointing approaches. Besides, VPC uses a globally consistent checkpointing algorithm, which preserves the global consistency of the VMs' execution and communication states, and only saves the updated memory pages during each checkpointing interval to reduce the entire VC downtime. Our implementation reveals that, compared with past VC checkpointing/migration solutions including VNsnap, VPC reduces the solo VM downtime by as much as 45%, under the NPB benchmark, and reduces the entire VC downtime by as much as 50%, under the NPB distributed program. Additionally, VPC incurs a memory overhead of no more than 9%. In all cases, VPC's performance overhead is less than 16%.
doi:10.1109/sbac-pad.2012.31 dblp:conf/sbac-pad/LuRK12 fatcat:4axv6vqxqre2zp2a4wlpajhyea