Filters








5,054 Hits in 12.1 sec

Understanding the propagation of hard errors to software and implications for resilient system design

Man-Lap Li, Pradeep Ramachandran, Swarup Kumar Sahoo, Sarita V. Adve, Vikram S. Adve, Yuanyuan Zhou
2008 SIGPLAN notices  
We focus on hard faults because they are increasingly important and have different system implications than the much studied transients.  ...  We explore a cooperative hardware-software solution that watches for anomalous software behavior to indicate the presence of hardware faults.  ...  Acknowledgments We would like to thank Pradip Bose from IBM and Subhasish Mitra from Stanford University for many discussions on this work and insightful comments on previous versions of this paper.  ... 
doi:10.1145/1353536.1346315 fatcat:l2ipmrguofa7nmnatgraio5vau

Understanding the propagation of hard errors to software and implications for resilient system design

Man-Lap Li, Pradeep Ramachandran, Swarup Kumar Sahoo, Sarita V. Adve, Vikram S. Adve, Yuanyuan Zhou
2008 ACM SIGOPS Operating Systems Review  
We focus on hard faults because they are increasingly important and have different system implications than the much studied transients.  ...  We explore a cooperative hardware-software solution that watches for anomalous software behavior to indicate the presence of hardware faults.  ...  Acknowledgments We would like to thank Pradip Bose from IBM and Subhasish Mitra from Stanford University for many discussions on this work and insightful comments on previous versions of this paper.  ... 
doi:10.1145/1353535.1346315 fatcat:ca6pbnf535a7hjpenaxuyqzr3a

Understanding the propagation of hard errors to software and implications for resilient system design

Man-Lap Li, Pradeep Ramachandran, Swarup Kumar Sahoo, Sarita V. Adve, Vikram S. Adve, Yuanyuan Zhou
2008 SIGARCH Computer Architecture News  
We focus on hard faults because they are increasingly important and have different system implications than the much studied transients.  ...  We explore a cooperative hardware-software solution that watches for anomalous software behavior to indicate the presence of hardware faults.  ...  Acknowledgments We would like to thank Pradip Bose from IBM and Subhasish Mitra from Stanford University for many discussions on this work and insightful comments on previous versions of this paper.  ... 
doi:10.1145/1353534.1346315 fatcat:pcsal3m4vfd2xkliow3ark7iva

Understanding the propagation of hard errors to software and implications for resilient system design

Man-Lap Li, Pradeep Ramachandran, Swarup Kumar Sahoo, Sarita V. Adve, Vikram S. Adve, Yuanyuan Zhou
2008 Proceedings of the 13th international conference on Architectural support for programming languages and operating systems - ASPLOS XIII  
We focus on hard faults because they are increasingly important and have different system implications than the much studied transients.  ...  We explore a cooperative hardware-software solution that watches for anomalous software behavior to indicate the presence of hardware faults.  ...  Acknowledgments We would like to thank Pradip Bose from IBM and Subhasish Mitra from Stanford University for many discussions on this work and insightful comments on previous versions of this paper.  ... 
doi:10.1145/1346281.1346315 dblp:conf/asplos/LiRSAAZ08 fatcat:t6sgrwzzlzc5ncbjkfz5clgmse

Resilience Design Patterns - A Structured Approach to Resilience at Extreme Scale (version 1.0) [article]

Saurabh Hukerikar, Christian Engelmann
2016 arXiv   pre-print
We define a design framework that enhances our understanding of the important constraints and opportunities for solutions deployed at various layers of the system stack.  ...  The overall goal of this work is to enable a systematic methodology for the design and evaluation of resilience technologies in extreme-scale HPC systems that keep scientific applications running to a  ...  used in fault tolerance and the basic concepts of resilience to enable HPC designers, as well as system operators and users to understand the essence of the resilience patterns and use them in their designs  ... 
arXiv:1611.02717v2 fatcat:sumkgkwokzaonemt6oxnyxysra

Addressing failures in exascale computing

Marc Snir, Robert W Wisniewski, Jacob A Abraham, Sarita V Adve, Saurabh Bagchi, Pavan Balaji, Jim Belak, Pradip Bose, Franck Cappello, Bill Carlson, Andrew A Chien, Paul Coteus (+16 others)
2014 The international journal of high performance computing applications  
Software error rates are discussed in Section 4 in more detail. 6 Applicable Technologies The solution to the problem of resilience at exascale will require a synergistic use of multiple hardware and software  ...  Therefore, we considered the following three design points: (1) business as usual, (2) system-level resilience, and (3) application-level resilience.  ...  Department of Energy for its financial support of ICiS; the ICiS director and steering committee for the support provided to our workshop; and, in particular, Cheryl Zidel for her outstanding administrative  ... 
doi:10.1177/1094342014522573 fatcat:menonpmgdfflzamz2fsivevxqm

Classifying soft error vulnerabilities in extreme-Scale scientific applications using a binary instrumentation tool

Dong Li, Jeffrey S. Vetter, Weikuan Yu
2012 2012 International Conference for High Performance Computing, Networking, Storage and Analysis  
Extreme-scale scientific applications are at a significant risk of being hit by soft errors on supercomputers as the scale of these systems and the component density continues to increase.  ...  In order to better understand the specific soft error vulnerabilities in scientific applications, we have built an empirical fault injection and consequence analysis tool -BIFITthat allows us to evaluate  ...  Accordingly, the U.S. Government retains a non-exclusive, royalty-free license to publish or reproduce the published form of this contribution, or allow others to do so, for U.S. Government purposes.  ... 
doi:10.1109/sc.2012.29 dblp:conf/sc/LiVY12 fatcat:c6z3n655f5cbzddmfvi5f2fsly

A Pattern Language for High-Performance Computing Resilience

Saurabh Hukerikar, Christian Engelmann
2017 Proceedings of the 22nd European Conference on Pattern Languages of Programs - EuroPLoP '17  
We codified the well-known techniques for handling faults, errors and failures that have been devised, applied and improved upon over the past three decades in the form of design patterns.  ...  Using the pattern language enables the design and implementation of comprehensive resilience solutions as a set of interconnected resilience patterns that can be instantiated across layers of the system  ...  We thank our shepherd Klaus Marquardt for his comments and suggestions that greatly improved the manuscript.  ... 
doi:10.1145/3147704.3147718 dblp:conf/europlop/HukerikarE17 fatcat:o7jhimgmevd65odewj6s7enlue

LLFI: An Intermediate Code-Level Fault Injection Tool for Hardware Faults

Qining Lu, Mostafa Farahani, Jiesheng Wei, Anna Thomas, Karthik Pattabiraman
2015 2015 IEEE International Conference on Software Quality, Reliability and Security  
However, software based error resilience techniques need configurable and accurate fault injection techniques to evaluate their effectiveness.  ...  We demonstrate the utility of LLFI by using it to perform fault injection experiments into nine programs, and study the effect of different injection choices on their resilience, namely instruction type  ...  Acknowledgments: This research was supported in part by the Natural Sciences and Engineering Research Council of Canada (NSERC) through the Discovery Grants program, and a research gift from Lockheed Martin  ... 
doi:10.1109/qrs.2015.13 dblp:conf/qrs/LuFWTP15 fatcat:twfge2bnj5anjf27ta43l77cky

Understanding Reliability Implication of Hardware Error in Virtualization Infrastructure

Xin Xu, H. Howie Huang
2014 Hot Topics in System Dependability  
We further discuss the challenges of designing error tolerance techniques for the hypervisor.  ...  To understand reliability implication of hardware errors in virtualized systems, in this paper we develop a simulation-based framework that enables a comprehensive fault injection study on the hypervisor  ...  We thank the HotDep reviewers and Flavio Junqueira for their helpful suggestions. This work is supported by National Science Foundation grant CNS-1350766.  ... 
dblp:conf/hotdep/XuH14 fatcat:6bjmjkw2prcvhffzmfq2wbessm

Understanding soft error propagation using Efficient vulnerability-driven fault injection

Xin Xu, Man-Lap Li
2012 IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012)  
Moreover, we characterize different propagation behaviors of these non-derated faults and discuss the implications of designing future cross-layer solutions.  ...  To evaluate, statistical fault injection (SFI) is often used to estimate the error coverage of the underlying method.  ...  To understand how injected faults behave, we closely examine error propagation through the /-lOp level to the software level.  ... 
doi:10.1109/dsn.2012.6263923 dblp:conf/dsn/XuL12 fatcat:e7k7kwagszalxdalohl25ow6pu

Socio-technical Complex Systems of Systems: Can We Justifiably Trust Their Resilience? [chapter]

Luca Simoncini
2011 Lecture Notes in Computer Science  
I think these difficulties have serious implications for the builders of safety-critical systems, and for society at large.  ...  structuring actually controls interactions within and between systems, and limits error propagation in both time and space, i.e. constitutes real not just perceived or imagined boundaries  ... 
doi:10.1007/978-3-642-24541-1_35 fatcat:wh45zxkgxrcmddn6z6q72bclj4

A resilience markers framework for small teams

Dominic Furniss, Jonathan Back, Ann Blandford, Michael Hildebrandt, Helena Broberg
2011 Reliability Engineering & System Safety  
The framework presented here provides the basis for developing concrete measures for improving the resilience of organizations through training, system design, and organizational learning.  ...  Here we provide a framework for reasoning about resilience that requires representation of the level of analysis (from the individual to operational), a traceable link from abstract theory to specific  ...  Acknowledgments We would like to thank Björn Johansson, Jonas Lundberg, and Erik Prytz for their engaging discussion at the resilience workshop, Pukeberg, Sweden, which contributed to ideas in this paper  ... 
doi:10.1016/j.ress.2010.06.025 fatcat:ffcf3ekv7vb3lgowbbz5ba3msy

On Reducing Circuit Malfunctions Caused by Soft Errors

Ilia Polian, Sudhakar M. Reddy, Irith Pomeranz, Xun Tang, Bernd Becker
2008 2008 IEEE International Symposium on Defect and Fault Tolerance of VLSI Systems  
Methods to reduce system failures due to soft errors include use of redundancy and making circuit elements robust such that soft errors do not upset signal values.  ...  Thus goals on immunity to soft errors may not be achievable in highly optimized circuits without adding circuit redundancy and/or relaxing the requirements on system failures due to soft errors.  ...  In our opinion, the key to the design of dependable systems is the understanding of critical versus non-critical errors with respect to the specification.  ... 
doi:10.1109/dft.2008.20 dblp:conf/dft/PolianRPTB08 fatcat:f6b7jwcjeffhlkkt5pjd57n2fq

Software reliability and dependability

Bev Littlewood, Lorenzo Strigini
2000 Proceedings of the conference on The future of Software engineering - ICSE '00  
in others still the need to confront inherently hard problems of prediction and decision-making, both to clarify the limits of current understanding and to push them back.  ...  His main current interest is defining practical, rigorous methods for assessing the dependability of software and other systems subject to design faults, and for supporting development decisions to achieve  ...  ACKNOWLEDGEMENTS The authors' work was supported in part by EPSRC grants GR/L07673 and GR/L57296.  ... 
doi:10.1145/336512.336551 dblp:conf/icse/LittlewoodS00b fatcat:dapf76ufkba6xgab3khod72ucu
« Previous Showing results 1 — 15 out of 5,054 results