Filters








490 Hits in 3.0 sec

Resilient X10

David Cunningham, David Grove, Benjamin Herta, Arun Iyengar, Kiyokuni Kawachiya, Hiroki Murata, Vijay Saraswat, Mikio Takeuchi, Olivier Tardieu
2014 SIGPLAN notices  
•Existing exception semantics give strong synchronization guarantees Performance is within 90% of non-resilient X10 Kernel found in a number of algorithms, e.g.  ...  Failure awareness © 2014 IBM Corporation Resilient X10 Overview 3 Provide helpful semantics: •Failure reporting •Continuing execution on unaffected nodes •Preservation of synchronization: HBI principle  ...  MPI PAMI PAMI Sockets Sockets C++ C++ X10 X10 Implementing Resilient X10 (X10RT) External paxos group of processes -Lightweight resilient store -Still too much overhead (details in paper)   ... 
doi:10.1145/2692916.2555248 fatcat:up5khhdg2rahdnne3zwcw752qe

Resilient X10

David Cunningham, David Grove, Benjamin Herta, Arun Iyengar, Kiyokuni Kawachiya, Hiroki Murata, Vijay Saraswat, Mikio Takeuchi, Olivier Tardieu
2014 Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming - PPoPP '14  
•Existing exception semantics give strong synchronization guarantees Performance is within 90% of non-resilient X10 Kernel found in a number of algorithms, e.g.  ...  Failure awareness © 2014 IBM Corporation Resilient X10 Overview 3 Provide helpful semantics: •Failure reporting •Continuing execution on unaffected nodes •Preservation of synchronization: HBI principle  ...  MPI PAMI PAMI Sockets Sockets C++ C++ X10 X10 Implementing Resilient X10 (X10RT) External paxos group of processes -Lightweight resilient store -Still too much overhead (details in paper)   ... 
doi:10.1145/2555243.2555248 dblp:conf/ppopp/CunninghamGHIKMSTT14 fatcat:zumxyvkjhneztervvth7hjokou

Resilient Optimistic Termination Detection for the Async-Finish Model [chapter]

Sara S. Hamouda, Josh Milthorpe
2019 Lecture Notes in Computer Science  
Driven by increasing core count and decreasing mean-time-to-failure in supercomputers, HPC runtime systems must improve support for dynamic task-parallel execution and resilience to failures.  ...  In this paper, we propose optimistic finish, the first message-optimal resilient termination detection protocol for the async-finish model.  ...  LULESH X10 provides a resilient implementation of the LULESH shock hydrodynamics proxy application [8] based on rollback-recovery.  ... 
doi:10.1007/978-3-030-20656-7_15 fatcat:zmlbpfbcufd2nplc7plr7u6bbm

Data-Driven Maintenance Priority and Resilience Evaluation of Performance Loss in a Main Coolant System

Hongyan Dui, Zhe Xu, Liwei Chen, Liudong Xing, Bin Liu
2022 Mathematics  
Based on the LIM, RIMs for single component failure and multiple component failures were developed to measure the recovery efficiency of the system performance.  ...  In this paper, a resilience importance measure (RIM) for performance loss is proposed to evaluate the performance of the MCS.  ...  [9] established a resilience assessment model by quantifying the relationship between resilience and resilience components in the recovery from emergency accidents in NPPs.  ... 
doi:10.3390/math10040563 fatcat:r6toq62xujedxfbkyeg6rponam

A Java Task Pool Framework providing Fault-Tolerant Global Load Balancing

Jonas Posner, Claudia Fohry
2018 International Journal of Networking and Computing  
Our algorithm is shown to be correct in the sense that failures are either tolerated and the computed result is the same as in non-failure case, or the program aborts with an error message.  ...  It implements a comparatively simple algorithm that relies on a resilient data structure for storing backups of local pools and other information.  ...  Recovery is explained in detail in Section 3.3 for the single-failure case, and in Section 3.4 for the multiple-failure case.  ... 
doi:10.15803/ijnc.8.1_2 fatcat:u23fwvr2iffkriulb7rzpi42su

Semantics of (Resilient) X10 [chapter]

Silvia Crafa, David Cunningham, Vijay Saraswat, Avraham Shinnar, Olivier Tardieu
2014 Lecture Notes in Computer Science  
These principles permit an X10 programmer to write clean code that continues to work in the presence of place failure. The given semantics have additionally been mechanized in Coq.  ...  This model accurately captures the behavior of a large class of concurrent, multi-place X10 programs. Further, we introduce a formal model of resilience in X10.  ...  The failure of a location can be detected, allowing failure recovery.  ... 
doi:10.1007/978-3-662-44202-9_27 fatcat:xgnhfeklmnauhldjkzmfzvf6hu

Semantics of (Resilient) X10 [article]

Silvia Crafa and David Cunningham and Vijay Saraswat and Avraham Shinnar and Olivier Tardieu
2013 arXiv   pre-print
This model accurately captures the behavior of a large class of concurrent, multi-place X10 programs. Further, we introduce a formal model of resilience in X10.  ...  These principles permit an X10 programmer to write clean code that continues to work in the presence of place failure. The given semantics have additionally been mechanized in Coq.  ...  The failure of a location can be detected, allowing failure recovery.  ... 
arXiv:1312.3739v1 fatcat:tqyjnwb7gjfihcnknzindg226u

Fault Tolerance for Lifeline-Based Global Load Balancing

Claudia Fohry, Marco Bungart, Paul Plock
2017 Journal of Software Engineering and Applications  
Our algorithm is able to recover from multiple fail-stop failures. If recovery is not possible, it halts with an error message.  ...  After failures, the backup partner takes over saved copies and collects others. In case of multiple failures, invocations of the restore protocol are nested.  ...  X10 supports a mode called Resilient X10, in which the user program is notified in the event of a permanent place failure.  ... 
doi:10.4236/jsea.2017.1013053 fatcat:s5m4ebb3afafphtkooebsm7xxi

X10-FT

Chenning Xie, Zhijun Hao, Haibo Chen
2013 Proceedings of the 2013 International Workshop on Programming Models and Applications for Multicores and Manycores - PMAM '13  
We also provide a preliminary evaluation show the cost of providing fault-tolerance in X10-FT.  ...  based on the characteristics of the APGAS model to make checkpoints and consensus, which allows transparently handling machine failures in different granularities.  ...  Science and Technology Development Funds (No. 12QA1401700), a Foundation for the Author of National Excellent Doctoral Dissertation of PR China and Fundamental Research Funds for the Central Universities in  ... 
doi:10.1145/2442992.2442994 dblp:conf/ppopp/XieHC13 fatcat:wsntxv2rgjdupp6p3f7oceewk4

A Survey on Resiliency Techniques in Cloud Computing Infrastructures and Applications

Carlos Colman-Meixner, Chris Develder, Massimo Tornatore, Biswanath Mukherjee
2016 IEEE Communications Surveys and Tutorials  
One of the critical challenges is resiliency: disruptions due to failures (either accidental or because of disasters or attacks) may entail significant revenue losses (e.g., US$ 25.5 billion in 2010 for  ...  ., also including resilience of the middleware infrastructure). The third part focuses on resilience in application design and development.  ...  object) written in X10 or in a multi-purpose language.  ... 
doi:10.1109/comst.2016.2531104 fatcat:vzvkai7nkrbbda63fesn7zw4di

Fault Tolerance Schemes for Global Load Balancing in X10

Claudia Fohry, Marco Bungart, Jonas Posner
2015 Scalable Computing : Practice and Experience  
X10 and Resilient X10. X10 is a novel parallel language from IBM [3], which supports object orientation and exception handling in a similar way as Java. Following the Asynchronous PGAS (APGAS)  ...  One approach handles permanent node failures at user level. It is supported by Resilient X10, a Partitioned Global Address Space language that throws an exception when a place fails.  ...  Resilient X10 provides two mechanisms for failure notification. First, a DeadPlaceException (DPE) is raised in the event of a failure.  ... 
doi:10.12694/scpe.v16i2.1088 fatcat:kmpuxusdr5bznkanl7i6u2ubni

MODC: Resilience for disaggregated memory architectures using task-based programming [article]

Kimberly Keeton and Sharad Singhal and Haris Volos and Yupu Zhang and Ramesh Chandra Chaurasiya and Clarete Riana Crasta and Sherin T George and Nagaraju K N and Mashood Abdulla K and Kavitha Natarajan and Porno Shome and Sanish Suresh
2021 arXiv   pre-print
We present highlights of our MODC prototype and experimental results demonstrating that MODC-style resilience outperforms a checkpoint-based approach in the face of failures.  ...  They also provide an independent failure model, where computations or the compute nodes they run on may fail independently of the disaggregated memory; thus, data that's resident in the disaggregated memory  ...  Resilient X10 [19] proposes extensions to the X10 task-parallel language [17] to expose failures to programmers, who can then handle individual task failures by exploiting domain-specific knowledge  ... 
arXiv:2109.05329v1 fatcat:xq6o5reg3nhndmvgooht2m3d74

A taxonomy of task-based parallel programming technologies for high-performance computing

Peter Thoman, Kiril Dichev, Thomas Heller, Roman Iakymchuk, Xavier Aguilar, Khalid Hasanov, Philipp Gschwandtner, Pierre Lemarinier, Stefano Markidis, Herbert Jordan, Thomas Fahringer, Kostas Katrinis (+2 others)
2018 Journal of Supercomputing  
In this paper, we provide an initial task-focused taxonomy for HPC technologies, which covers both programming interfaces and runtime mechanisms. We  ...  However, with the increase in parallel, many-core, and heterogeneous systems, a number of research-driven projects have developed more diversified task-based support, employing various programming and  ...  distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in  ... 
doi:10.1007/s11227-018-2238-4 fatcat:fctzmtp3n5fithxfchl5rub7j4

A Taxonomy Of Task-Based Technologies For High-Performance Computing

Peter Thoman, Khalid Hasanov, Kiril Dichev, Roman Iakymchuk, Xavier Aguilar, Philipp Gschwandtner, Pierre Lemarinier, Stefano Markidis, Herbert Jordan, Erwin Laure, Kostas Katrinis, Dimitrios S. Nikolopoulos (+1 others)
2017 Zenodo  
In this paper, we provide an initial task-focused taxonomy for HPC technologies, which covers both programming interfaces and runtime mechanisms.  ...  We demonstrate the usefulness of our taxonomy by classifying state-of-the-art task-based environments in use today. The final publication is available at Springer LNCS.  ...  In such a scenario, a process cannot detect its failure; however, in a distributed run, another process may detect the failure, and trigger a recovery strategy across all processes.  ... 
doi:10.5281/zenodo.1162306 fatcat:7d7lu2l6kfc3necv3pdien6xc4

A Taxonomy Of Task-Based Parallel Programming Technologies For High-Performance Computing

Peter Thoman, Kiril Dichev, Khalid Hasanov, Roman Iakymchuk, Xavier Aguilar, Thomas Heller, Philipp Gschwandtner, Pierre Lemarinier, Stefano Markidis, Herbert Jordan, Thomas Fahringer, Kostas Katrinis (+2 others)
2017 Zenodo  
We demonstrate the usefulness of our taxonomy by classifying state-of-the-art task-based environments in use today.  ...  In this paper, we provide an initial task-focused taxonomy for HPC technologies, which covers both programming interfaces and runtime mechanisms.  ...  In such a scenario, a process cannot detect its failure; however, in a distributed run, another process may detect the failure, and trigger a recovery strategy across all processes.  ... 
doi:10.5281/zenodo.1119094 fatcat:kbuhio5hu5bs7kqkuj5s4jijdi
« Previous Showing results 1 — 15 out of 490 results