A convergence of key-value storage systems from clouds to supercomputers

Tonglin Li, Xiaobing Zhou, Ke Wang, Dongfang Zhao, Iman Sadooghi, Zhao Zhang, Ioan Raicu
2015 Concurrency and Computation  
This paper presents a convergence of distributed Key-Value storage systems in clouds and supercomputers. It specifically presents ZHT, a zero-hop distributed key-value store system, which has been tuned for the requirements of high-end computing systems. ZHT aims to be a building block for future distributed systems, such as parallel and distributed file systems, distributed job management systems, and parallel programming systems. ZHT has some important properties, such as being light-weight,
more » ... ynamically allowing nodes join and leave, fault tolerant through replication, persistent, scalable, and supporting unconventional operations such as append, compare and swap, callback in addition to the traditional insert/lookup/remove. We have evaluated ZHT's performance under a variety of systems, ranging from a Linux cluster with 64-nodes, an Amazon EC2 virtual cluster up to 96-nodes, to an IBM Blue Gene/P supercomputer with 8K-nodes. We compared ZHT against other key/value stores and found it offers superior performance for the features and portability it supports. This paper also presents several real systems that have adopted ZHT, namely FusionFS (a distributed file system), IStore (a storage system with erasure coding), MATRIX (distributed scheduling), Slurm++ (distributed HPC job launch), Fabriq (distributed message queue management); all of these real systems have been simplified due to Key-Value storage systems, and have been shown to outperform other leading systems by orders of magnitude in some cases. It's important to highlight that some of these systems are rooted in HPC systems from supercomputers, while others are rooted in clouds and ad-hoc distributed systems; through our work, we have shown how versatile Key-Value storage systems can be in such a variety of environments. 2 astronomy, bioinformatics [2] , and financial analysis) share these data management challenges, strengthening the potential impact from generic solutions. "A supercomputer is a device for turning compute-bound problems into I/O bound problems" [3] . The quote from Ken Batcher reveals the essence of modern high performance computing and implies an ever-growing shift in bottlenecks from compute to I/O. For exascale computers, the challenges are even more radical, as the only viable approaches in the next decade to achieve exascale computing all involve extremely high parallelism and concurrency [4] . Up to 2015, some of the biggest systems already have more than 3 million general-purpose cores. Many experts predict that exascale computing will be a reality by the end of the decade; an exascale system is expected to have millions of nodes, billions of threads of execution, hundreds of petabytes of memory, and exabyte of persistent storage. In the current decades-old architecture of HPC systems, storage(e.g. parallel file systems, such as GPFS [5], PVFS [6] and Lustre [7]) is completely separated from compute resources. The connection between them is a high speed network. This approach is not able to scale several orders of magnitude in terms of concurrency and throughput, and will thus prevent the move from petascale to exascale. The unscalable storage architecture could be a "showstopper" in building exascale systems [4] . Although there are works such as burst-buffer [8, 9] to alleviate the parallel file system bottleneck, in the long run the need for building efficient and scalable distributed storage for high performance computing (HPC) systems that will scale three to four orders of magnitude is on the horizon. One of the major bottlenecks in current state-of-the-art storage systems is metadata management. Metadata operations on most of parallel and distributed file systems can be inefficient at large scales. Our previous work( fig.1 ) on a Blue Gene/P supercomputer with 16Kcores shows the various costs for file/directory creating(metadata operation of file systems) on GPFS. GPFS's metadata performance degrades rapidly under concurrent operations, reaching saturation at only 4 to 32 core scales (on a 160K-core machine). Ideal performance would have been constant at different scales, but we see the cost of these basic metadata operations (e.g. create file) growing exponentially, from tens of milliseconds on a single node (four-cores), to tens of seconds at 16K-core scales; at full machine scale of 160K-cores, we expect one file creation to take over two minutes for the many directory case, and over 10 minutes for the single directory case. Previous work shows these times to be even worse, putting the full system scale metadata operations in hours range, although GPFS might have been improved over the last several years. On a large scale HPC system, whether the time per metadata operation is minutes or hours, the conclusion is that the metadata management in GPFS does not have enough degree of distribution, and not enough emphasis was placed on avoiding lock contention. Other parallel or distributed file systems (e.g. Google's GFS and Yahoo's HDFS) that have centralized metadata management make the problems observed with GPFS even worse from the scalability perspective. Future storage systems for high-end computing should support distributed metadata management, leveraging distributed data-structure tailored for this environment. The distributed data-structures share some characteristics with structured distributed hash tables, having resilience in face of failures with high availability; however, they should support close to constant time operations and deliver the low latencies typically found in centralized metadata management (under light load). HPC storage is not the only area that suffers the storage bottleneck. Similar with the HPC scenarios, cloud based distributed systems also have to face storage bottleneck. Furthermore, due to the dynamic nature of cloud applications, a suitable storage system needs to satisfy more requirements, such as being able to handle dynamic nodes join and leave on the fly and the flexibility to run on different cloud instance types simultaneously. As an initial attempt to meet these needs, we propose and build ZHT(zero-hop distributed hash table [10] [11] [12] [13] ), an instance of NoSQL database [14] . ZHT has been tuned for the specific requirements of high-end computing (e.g. trustworthy/reliable hardware, fast networks, nonexistent "churn", low latencies). ZHT aims to be a building block for future distributed
doi:10.1002/cpe.3614 fatcat:6xr3stuvbzdrpnt2gkxqyxzvj4