5,276 Hits in 5.1 sec

A scalable double in-memory checkpoint and restart scheme towards exascale

Gengbin Zheng, Xiang Ni, Laxmikant V. Kale
2012 IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012)  
We extend the in-memory checkpointing scheme to work on MPI communication layer, and demonstrate the performance on very large scale supercomputers.  ...  As the size of supercomputers increases, the probability of system failure grows substantially, posing an increasingly significant challenge for scalability.  ...  This paper presented several optimization techniques to a scalable double in-memory checkpoint/restart scheme to improve its scalability towards exascale.  ... 
doi:10.1109/dsnw.2012.6264677 dblp:conf/dsn/ZhengNK12 fatcat:p56cp4bohzh7jli3rrfvtkb4sy

Checkpointing Orchestration: Toward a Scalable HPC Fault-Tolerant Environment

Hui Jin, Tao Ke, Yong Chen, Xian-He Sun
2012 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)  
However, the overhead of checkpointing is a subject of increasing in concern in recent years, especially for large-scale parallel computer systems.  ...  Experimental results show that checkpointing orchestration reduced the checkpointing cost at a degree of more than 30%.  ...  The authors would like to acknowledge Joshua Hursey of Open MPI group at Indiana University and Samuel Lang of PVFS2 group at Argonne National Lab for their valuable assistance in the implementation of  ... 
doi:10.1109/ccgrid.2012.61 dblp:conf/ccgrid/JinKCS12 fatcat:svw6e2m64rdbjarxerswko5woe

Macroscopic characterisations of Web accessibility

Rui Lopes, Luis Carriço
2010 New Review of Hypermedia and Multimedia  
checkpoints at a large scale.  ...  of accessibility at large scale, including: . compare document collections from different years to study the evolution of the Web with respect to accessibility (lack of) compliance, in order to 240 R.  ... 
doi:10.1080/13614568.2010.534185 fatcat:l43huio7hbgnnoo434byiju3vq

Parallel I/O Performance for Application-Level Checkpointing on the Blue Gene/P System

Jing Fu, Misun Min, Robert Latham, Christopher D. Carothers
2011 2011 IEEE International Conference on Cluster Computing  
Application-level checkpointing is one of the most popular techniques to proactively deal with unexpected failures because of its portability and flexibility.  ...  Our study shows that rbIO and coIO result in 100× improvement over previous checkpointing approaches on up to 65,536 processors of the Blue Gene/P using the GPFS.  ...  [7] used several file domain partitioning techniques to improve collective I/O performance on the Cray XT4 and on clusters.  ... 
doi:10.1109/cluster.2011.81 dblp:conf/cluster/FuMLC11 fatcat:bqzs6wovobgejeo7e5ybtpm2qq

On Processing Extreme Data

Dana Petcu, Gabriel Iuhasz, Daniel Pop, Domenico Talia, Jesus Carretero, Radu Prodan, Thomas Fahringer, Ivan Grasso, Ramon Doallo, Maria J. Martin, Basilio B. Fraguela, Roman Trobec (+11 others)
2016 Scalable Computing : Practice and Experience  
The necessary storage infrastructure need to be investigated to address the concurrency needs of monitoring and event processing at large scale.  ...  These scalable tools based on novel programming paradigms should be used to design scalable codes that support the implementation of large-scale data mining applications.  ... 
doi:10.12694/scpe.v16i4.1134 fatcat:yibmtpz5szgudgocgoojodhjc4

Reliability-aware scalability models for high performance computing

Ziming Zheng, Zhiling Lan
2009 2009 IEEE International Conference on Cluster Computing and Workshops  
The derived reliability-aware models can be used to predict application scalability in failure-present environments and evaluate fault tolerance techniques.  ...  In this study, we extend two well-known models, namely Amdahl's law and Gustafson's law, by considering the impact of failures and the effect of fault tolerance techniques on applications.  ...  The authors like to thank Yong Chen and Hui Jin from the Scalable Computing Software Laboratory at IIT for their valuable discussions. We would like to thank Prof.  ... 
doi:10.1109/clustr.2009.5289177 dblp:conf/cluster/ZhengL09 fatcat:7cyv7iybknappksyjb5vu3cyuy

The Reliability Wall for Exascale Supercomputing

Xuejun Yang, Zhiyuan Wang, Jingling Xue, Yun Zhou
2012 IEEE transactions on computers  
Inevitably, large-scale supercomputing systems, especially those at the peta/exascale levels, must tolerate failures, by incorporating faulttolerance mechanisms to improve their reliability and availability  ...  This paper introduces for the first time the concept of "Reliability Wall" to highlight the significance of achieving scalable performance in peta/exascale supercomputing with fault tolerance.  ...  ACKNOWLEDGMENTS The authors thank the reviewers for their helpful comments and suggestions, which greatly improved the final version of the paper.  ... 
doi:10.1109/tc.2011.106 fatcat:s7hjkqkcfjfatpkdscxaov22hi

DEE: A Distributed Fault Tolerant Workflow Enactment Engine for Grid Computing [chapter]

Rubing Duan, Radu Prodan, Thomas Fahringer
2005 Lecture Notes in Computer Science  
It is a large and complex task to design and implement a workflow management system that supports scalable executions of largescale scientific workflows in distributed and unstable Grid environments.  ...  DEE proposes a de-centralized architecture that simplifies and reduces the overhead for managing large workflows through partitioning, improved data locality, and reduced workflow-level checkpointing overhead  ...  This research is partially supported by the Austrian Science Fund as part of the Aurora project under contract SFBF1104 and the Austrian Federal Ministry for Education, Science and Culture as part of the  ... 
doi:10.1007/11557654_81 fatcat:xyj3h33lznfezbw6xknkxhl2w4

Reliability in grid computing systems

Christopher Dabrowski
2009 Concurrency and Computation  
The need to manage large numbers of computational, data, and network resources under conditions of scale, heterogeneity, and dynamism distinguishes grid systems from other types of distributed systems.  ...  The survey identifies important issues and problems that researchers are working to overcome in order to develop reliability methods for large-scale, heterogeneous, dynamic environments.  ...  ACKNOWLEDGEMENTS I wish to thank Matti Hiltunen of AT&T his many insightful comments that helped improve this manuscript.  ... 
doi:10.1002/cpe.1410 fatcat:xih4uaq3unf7hcxa67ssxoh2jm

Exascale Machines Require New Programming Paradigms and Runtimes

2015 Supercomputing Frontiers and Innovations  
In this article, we explore the shortcomings of existing programming models and runtimes for large-scale computing systems.  ...  This article is structured as follows: the next section describes the requirements from the programmability point of view for extra large-scale systems such as ultrascale systems.  ...  To improve the programmability of Improved programmability for extra large-scale systems Supercomputers have become an essential tool in numerous research areas.  ... 
doi:10.14529/jsfi150201 fatcat:ozj4czefxrd37j7djcxuukyuee

Towards Scalable Checkpoint Restart: A Collective Inline Memory Contents Deduplication Proposal

Bogdan Nicolae
2013 2013 IEEE 27th International Symposium on Parallel and Distributed Processing  
However, exploding checkpoint sizes that need to be dumped to storage pose a major scalability challenge, prompting the need to reduce the amount of checkpointing data.  ...  For a large class of applications that run for a long time and are tightly coupled, Checkpoint-Restart (CR) is the only feasible method to survive failures.  ...  ACKNOWLEDGMENTS The experiments presented in this paper were carried out using the Shamrock cluster of IBM Research, Ireland and the Grid'5000/ALADDIN-G5K experimental testbed, an initiative of the French  ... 
doi:10.1109/ipdps.2013.14 dblp:conf/ipps/Nicolae13 fatcat:ciyjjh3pdraavbfonxz44y6j4e

VELOC: VEry Low Overhead Checkpointing in the Age of Exascale [article]

Bogdan Nicolae and Adam Moody and Gregory Kosinovsky and Kathryn Mohror and Franck Cappello
2021 arXiv   pre-print
Checkpointing large amounts of related data concurrently to stable storage is a common I/O pattern of many HPC applications.  ...  VeloC offers a simple API at user level, while employing an advanced multi-level resilience strategy that transparently optimizes the performance and scalability of checkpointing by leveraging heterogeneous  ...  Furthermore such techniques can take advantage of already existing replicas that are naturally produced by large-scale data-parallel training techniques.  ... 
arXiv:2103.02131v1 fatcat:53tvxe2iszde5gwkr4dy6gxeeq

Asynchronous object storage with QoS for scientific and commercial big data

Michael J. Brim, David A. Dillow, Sarp Oral, Bradley W. Settlemyer, Feiyi Wang
2013 Proceedings of the 8th Parallel Data Storage Workshop on - PDSW '13  
The architecture of the Scalable Object Store (SOS), a prototype object storage system that supports the API's facilities, is presented.  ...  Use cases from the target workload domains are used to motivate the key abstractions used in the application programming interface (API).  ...  Large-scale application checkpoint workloads consist processes concurrently writing data to the storage system periodically and occasionally reading checkpoint data to restart after interruptions.  ... 
doi:10.1145/2538542.2538565 dblp:conf/sc/BrimDOSW13 fatcat:pknsyjf65jbs3busd7oxmxgxwq

Optimizing HPC Fault-Tolerant Environment: An Analytical Approach

Hui Jin, Yong Chen, Huaiyu Zhu, Xian-He Sun
2010 2010 39th International Conference on Parallel Processing  
Performance scalability under failures is also studied to explore the performance improvement space for different parameters.  ...  Based on the model, we gauge three parameters, the number of compute nodes, checkpointing interval, and the number of spare nodes to conduct a comprehensive study of performance optimization under failures  ...  the contract No.  ... 
doi:10.1109/icpp.2010.80 dblp:conf/icpp/JinCZS10 fatcat:zaafiumudjdm7p4napqpthgicu

Efficient and Scalable Retrieval Techniques for Global File Properties

Dong H. Ahn, Michael J. Brim, Bronis R. de Supinski, Todd Gamblin, Gregory L. Lee, Matthew P. Legendre, Barton P. Miller, Adam Moody, Martin Schulz
2013 2013 IEEE 27th International Symposium on Parallel and Distributed Processing  
Even the most expensive operation, which checks global file consistency, completes in under 7 seconds at this scale, an improvement of several orders of magnitude over the traditional checksum technique  ...  Large-scale systems typically mount many different file systems with distinct performance characteristics and capacity.  ...  Additionally, we apply our techniques to three case studies and show how FGFS enables a wide range of HPC software to improve the scalability of its file I/O patterns.  ... 
doi:10.1109/ipdps.2013.49 dblp:conf/ipps/AhnBSGLLMMS13 fatcat:ddxljot3rfh6vffsnwydyihwau
« Previous Showing results 1 — 15 out of 5,276 results