Filters








296 Hits in 2.9 sec

Resilient Optimistic Termination Detection for the Async-Finish Model [chapter]

Sara S. Hamouda, Josh Milthorpe
2019 Lecture Notes in Computer Science  
Driven by increasing core count and decreasing mean-time-to-failure in supercomputers, HPC runtime systems must improve support for dynamic task-parallel execution and resilience to failures.  ...  reducing the overhead for failure-free execution.  ...  Resilient X10 provides user-level fault tolerance support by extending the async-finish model with failure awareness.  ... 
doi:10.1007/978-3-030-20656-7_15 fatcat:zmlbpfbcufd2nplc7plr7u6bbm

Resilient Work Stealing [article]

Pascal Costanza, Charlotte Herzeel, Wolfgang De Meuter, Roel Wuyts
2017 arXiv   pre-print
overheads in the presence of single and multiple failures.  ...  A comparison with the work-stealing scheduler of Threading Building Blocks on the PARSEC benchmark suite shows that Cobra incurs no performance overhead in the absence of failures, and low performance  ...  The result is Cobra, a novel resilience-aware work-stealing scheduler for fully strict tree-recursive fork/join computations.  ... 
arXiv:1706.03539v1 fatcat:6ra44pvelbbpfaixu2vrg5hyim

MODC: Resilience for disaggregated memory architectures using task-based programming [article]

Kimberly Keeton and Sharad Singhal and Haris Volos and Yupu Zhang and Ramesh Chandra Chaurasiya and Clarete Riana Crasta and Sherin T George and Nagaraju K N and Mashood Abdulla K and Kavitha Natarajan and Porno Shome and Sanish Suresh
2021 arXiv   pre-print
We present highlights of our MODC prototype and experimental results demonstrating that MODC-style resilience outperforms a checkpoint-based approach in the face of failures.  ...  is unaffected by the compute failure.  ...  Resilient X10 [19] proposes extensions to the X10 task-parallel language [17] to expose failures to programmers, who can then handle individual task failures by exploiting domain-specific knowledge  ... 
arXiv:2109.05329v1 fatcat:xq6o5reg3nhndmvgooht2m3d74

A Survey on Resiliency Techniques in Cloud Computing Infrastructures and Applications

Carlos Colman-Meixner, Chris Develder, Massimo Tornatore, Biswanath Mukherjee
2016 IEEE Communications Surveys and Tutorials  
One of the critical challenges is resiliency: disruptions due to failures (either accidental or because of disasters or attacks) may entail significant revenue losses (e.g., US$ 25.5 billion in 2010 for  ...  Before moving to the detailed resilience aspects, we provide a qualitative overview of the types of failures that may occur (from the perspective of the layered cloud architecture), and their consequences  ...  X10RT extends the cloud computing application programming language X10 by adding a run-time, i.e., "memory resident algorithm" [213] that detects and repairs the logic of any program (i.e., the core  ... 
doi:10.1109/comst.2016.2531104 fatcat:vzvkai7nkrbbda63fesn7zw4di

Exploring versioned distributed arrays for resilience in scientific applications

A Chien, P Balaji, N Dun, A Fang, H Fujita, K Iskra, Z Rubenstein, Z Zheng, J Hammond, I Laguna, D Richards, A Dubey (+5 others)
2016 The international journal of high performance computing applications  
Using several large applications (OpenMC, preconditioned conjugate gradient (PCG) solver, ddcMD, and Chombo), we evaluate the programming effort to add resilience.  ...  We present the Global View Resilience (GVR) system, a library for portable resilience.  ...  To exploring the impact of failure rate and detection latency on recovery efficiency, we use the failure model proposed by Snir et al. (2014) to manually inject errors and set versioning interval and  ... 
doi:10.1177/1094342016664796 fatcat:aaipn5vawrg4dhzka4rigj325y

A Java Task Pool Framework providing Fault-Tolerant Global Load Balancing

Jonas Posner, Claudia Fohry
2018 International Journal of Networking and Computing  
Our algorithm is shown to be correct in the sense that failures are either tolerated and the computed result is the same as in non-failure case, or the program aborts with an error message.  ...  Application-level approaches are becoming increasingly popular, since they may be more efficient.  ...  The name X10 reflects the language's goal to raise programming efficiency by a factor of 10. The X10 syntax is inspired by Java.  ... 
doi:10.15803/ijnc.8.1_2 fatcat:u23fwvr2iffkriulb7rzpi42su

M3R: Increased performance for in-memory Hadoop jobs [article]

Avraham Shinnar, David Cunningham, Benjamin Herta, Vijay Saraswat
2012 arXiv   pre-print
It does not support resilience, and supports only those workloads which can fit into cluster memory.  ...  Main Memory Map Reduce (M3R) is a new implementation of the Hadoop Map Reduce (HMR) API targeted at online analytics on high mean-time-to-failure clusters.  ...  The job controller itself is a single point of failure, but known techniques can be applied to make it resilient.  ... 
arXiv:1208.4168v1 fatcat:lnsnqc2ak5adblsp4ojzctbfb4

M3R

Avraham Shinnar, David Cunningham, Vijay Saraswat, Benjamin Herta
2012 Proceedings of the VLDB Endowment  
It does not support resilience, and supports only those workloads which can fit into cluster memory.  ...  Main Memory Map Reduce (M3R) is a new implementation of the Hadoop Map Reduce (HMR) API targeted at online analytics on high mean-time-to-failure clusters.  ...  The job controller itself is a single point of failure, but known techniques can be applied to make it resilient.  ... 
doi:10.14778/2367502.2367513 fatcat:bnh6bmmorfdstgcbmczabbo5tu

A Taxonomy Of Task-Based Technologies For High-Performance Computing

Peter Thoman, Khalid Hasanov, Kiril Dichev, Roman Iakymchuk, Xavier Aguilar, Philipp Gschwandtner, Pierre Lemarinier, Stefano Markidis, Herbert Jordan, Erwin Laure, Kostas Katrinis, Dimitrios S. Nikolopoulos (+1 others)
2017 Zenodo  
Task-based programming models for shared memory -- such as Cilk Plus and OpenMP 3 -- are well established and documented.  ...  In this paper, we provide an initial task-focused taxonomy for HPC technologies, which covers both programming interfaces and runtime mechanisms.  ...  The Cilk language 5 [19] allows task-focused parallel programming, and is an early example of efficient task scheduling via work stealing.  ... 
doi:10.5281/zenodo.1162306 fatcat:7d7lu2l6kfc3necv3pdien6xc4

A Taxonomy Of Task-Based Technologies For High-Performance Computing

Peter Thoman, Khalid Hasanov, Kiril Dichev, Roman Iakymchuk, Xavier Aguilar, Philipp Gschwandtner, Pierre Lemarinier, Stefano Markidis, Herbert Jordan, Erwin Laure, Kostas Katrinis, Dimitrios S. Nikolopoulos (+1 others)
2017 Zenodo  
Task-based programming models for shared memory -- such as Cilk Plus and OpenMP 3 -- are well established and documented.  ...  In this paper, we provide an initial task-focused taxonomy for HPC technologies, which covers both programming interfaces and runtime mechanisms.  ...  The Cilk language [19] allows task-focused parallel programming, and is an early example of efficient task scheduling via work stealing.  ... 
doi:10.5281/zenodo.1155586 fatcat:3t27vjucovcxvmjsiowqhkdtrq

A Taxonomy Of Task-Based Parallel Programming Technologies For High-Performance Computing

Peter Thoman, Kiril Dichev, Khalid Hasanov, Roman Iakymchuk, Xavier Aguilar, Thomas Heller, Philipp Gschwandtner, Pierre Lemarinier, Stefano Markidis, Herbert Jordan, Thomas Fahringer, Kostas Katrinis (+2 others)
2017 Zenodo  
Task-based programming models for shared memory -- such as Cilk Plus and OpenMP 3 -- are well established and documented.  ...  In this paper, we provide an initial task-focused taxonomy for HPC technologies, which covers both programming interfaces and runtime mechanisms.  ...  The Cilk language 1 [5] allows task-focused parallel programming, and is an early example of efficient task scheduling via work stealing.  ... 
doi:10.5281/zenodo.1119094 fatcat:kbuhio5hu5bs7kqkuj5s4jijdi

Resilience in high-level parallel programming languages [article]

Sara S. Hamouda, University, The Australian National, University, The Australian National
2019
The consistent trends of increasing core counts and decreasing mean-time-to-failure in supercomputers make supporting task parallelism and resilience a necessity in HPC programming models.  ...  Recent advances in the APGAS model supported control flow recovery by adding failure awareness to the nested parallelism model --- async-finish --- and by providing structured failure reporting through  ...  Therefore, while Charm++ programs are oblivious to failures, Erlang programs are aware of failures because failure handling is defined at the user level.  ... 
doi:10.25911/5d0cb264c1c22 fatcat:i75kgocqcfbmxgd4cb723tvgoi

Exascale Machines Require New Programming Paradigms and Runtimes

2015 Supercomputing Frontiers and Innovations  
Furthermore, existing programming models already require heroic programming and optimization efforts to achieve high efficiency on current supercomputers.  ...  We propose and discuss important features of programming paradigms and runtimes to deal with exascale computing systems with a special focus on data-intensive applications and resilience.  ...  Resilience As exascale systems grow in computational power and scale, failure rates inevitably increase.  ... 
doi:10.14529/jsfi150201 fatcat:ozj4czefxrd37j7djcxuukyuee

The role of concurrency in an evolutionary view of programming abstractions [article]

Silvia Crafa
2015 arXiv   pre-print
In this paper we examine how concurrency has been embodied in mainstream programming languages.  ...  This paper is not meant to be a survey of modern mainstream programming languages: it would be very incomplete in that sense.  ...  Reactivity to failures asks for programming styles that enforce application resilience, in order to quickly recover from software failures, hardware failures, and communication failures.  ... 
arXiv:1507.07719v1 fatcat:zi5vcohn6rfyhpcnw5kctaenru

Predictive Reliability and Fault Management in Exascale Systems

Ramon Canal, Carles Hernandez, Rafa Tornero, Alessandro Cilardo, Giuseppe Massari, Federico Reghenzani, William Fornaciari, Marina Zapater, David Atienza, Ariel Oleksiak, Wojciech PiĄtek, Jaume Abella
2020 ACM Computing Surveys  
Programming Models and Runtime Managers Several programming models include now resilience support.  ...  for exascale systems should expressly support the selection of the optimal checkpointing strategy, depending on each application's execution characteristics, as well as scheduling decisions that are resilience-aware  ... 
doi:10.1145/3403956 fatcat:77xcpnevmnc5jfpj6ynhwdng3m
« Previous Showing results 1 — 15 out of 296 results