Constructing collaborative desktop storage caches for large scientific datasets
ACM Transactions on Storage
· Sudharshan Vazhkudai et al. High-end computing is suffering a data deluge from experiments, simulations, and apparatus that creates overwhelming application dataset sizes. This has led to the proliferation of high-end mass storage systems, storage area clusters, and data centers. These storage facilities offer a large range of choices in terms of capacity and access rate, as well as strong data availability and consistency support. However, for most end-users, the "last mile" in their
... pipeline often requires data processing and visualization at local computers, typically local desktop workstations. End-user workstations-despite more processing power than ever before-are ill-equipped to cope with such data demands due to insufficient secondary storage space and I/O rates. Meanwhile, a large portion of desktop storage is unused. We propose the FreeLoader framework, which aggregates unused desktop storage space and I/O bandwidth into a shared cache/scratch space, for hosting large, immutable datasets and exploiting data access locality. This paper presents the FreeLoader architecture, component design, and performance results based on our proof-of-concept prototype. Its architecture comprises contributing benefactor nodes, steered by a management layer, providing services such as data integrity, high performance, load balancing, and impact control. Our experiments show that FreeLoader is an appealing low-cost solution to storing massive datasets, by delivering higher data access rates than traditional storage facilities: namely, local or remote shared file systems, storage systems, and Internet data repositories. In particular, we present novel data striping techniques that allow FreeLoader to efficiently aggregate a workstation's network communication bandwidth and local I/O bandwidth. In addition, the performance impact on the native workload of donor machines is small and can be effectively controlled. Further, we show that security features such as data encryptions and integrity checks can be easily added as filters for interested clients. Finally, we demonstrate how legacy applications can use the FreeLoader API to store and retrieve datasets. · 3 provide convenience in connecting users' computing/visualization tasks with other tools used daily in their work and collaboration, such as editors, spreadsheet tools, web browsers, multimedia players, and visual conference tools. Finally, compared to high-end computing systems that are often built to last for years, desktop workstations at research institutions get updated more often and typically have higher compute power than individual nodes of a large, parallel system. This is especially advantageous for running sequential programs, and there exist many essential scientific computing tools that are not parallel. Applications that were once beyond the capability of a single workstation are now routinely executed on personal desktop computers. The combination of fast CPU, large memory, and the prospering Linux environment provides scientists with a familiar-yet powerful-computing platform right in their office, often times enabling them to avoid the overhead of obtaining parallel computer accounts, frequent data movement, and submitting, as well as waiting for the completion of batch jobs. While personal computers are up to their important roles in scientific workflows with advantages in human-computer interface and processing power, storage nowadays usually becomes their limiting factor. Commodity desktop computers are often equipped with limited secondary storage capability and I/O rates. Shared storage in university departments and research labs are mostly provided for hosting ordinary documents such as email and web-pages, and usually comes with small quota, low bandwidth, and heavy workloads. This imbalance between compute power and storage resources leaves scientists with two unattractive choices when processing datasets larger than their workstations' available disk space. First, their workstations can remotely access the data sets-but the wide-area network latencies kill performance. Second, they can use a high-performance computer, which has sufficient disk space-but will have to perform their computation either at a crowded head node or through a batch system. Users may also choose to install a large storage system accessible from their desktop workstations. However, this is not cheap. Although disks themselves are relatively affordable today (at $1000 to $2000 for 1 TB), building a storage system requires expensive hardware such as fiber channel switches. For example, a 365GB disk array currently costs over $6000 and a 4TB array costs over $40,000, 1 which is a non-trivial expense, especially for academic and government research environments. This has not yet taken into account the maintenance costs. Although price is expected to fall for the same capacity, data size is expected to rise, often at a higher speed. In fact, parallel simulations can easily generate TBs of data per application per day already [Bair et al. 2004] . When groups of users store their scientific datasets in a shared storage system, even a large space can quickly be exhausted, as demonstrated by shared scratch file systems at supercomputer centers. Further, even when workstation-attached storage is abundant, users normally choose not to retain copies of the downloaded scientific datasets on their desktops beyond the processing duration. These datasets are several orders of GB or larger and are usually archived in mass storage systems, file servers, etc. Subsequent requests to these datasets involve data migration from archival systems at transfer rates significantly lower than local I/O or LAN throughput [Lee et al. 2002 ; Lee 1 Price quote from www.ibm.com as of 2005.