A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2014; you can also visit the original URL.
The file type is application/pdf
.
Filters
Precision-aware soft error protection for GPUs
2014
2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)
We therefore propose a novel precision-aware protection approach for the GPU execution logic and register file to mitigate large magnitude errors. ...
With the advent of general-purpose GPU computing, it is becoming increasingly desirable to protect GPUs from soft errors. ...
Section 2 discusses prior proposals for soft error mitigation and motivates precision-aware protection. ...
doi:10.1109/hpca.2014.6835966
dblp:conf/hpca/PalframanKL14
fatcat:uiipqczf2rb4zlrvlctmolsl6a
2018 Index IEEE Transactions on Computers Vol. 67
2019
IEEE transactions on computers
., þ, TC July 2018 1039-1045 Efficient Protection of the Register File in Soft-Processors Implemented on Xilinx FPGAs. ...
., þ, TC Dec. 2018 1703-1719 Optimization A GPU-Aware Parallel Index for Processing High-Dimensional Big Data. ...
doi:10.1109/tc.2018.2882120
fatcat:j2j7yw42hnghjoik2ghvqab6ti
The Visual Vulnerability Spectrum: Characterizing Architectural Vulnerability for Graphics Hardware
[article]
2006
Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware - HWWS '04
Current trends, expected to continue, show soft error rates increasing exponentially at a rate of 8% per technology generation. ...
With this analysis in hand, we suggest several targeted, inexpensive solutions that can mitigate the most egregious of soft error consequences. ...
We would like to extend out sincere thanks to the anonymous reviewers for their detailed and helpful comments. ...
doi:10.2312/eggh/eggh06/009-016
fatcat:itfqaqqprjbmrperxlhvdfttyy
Understanding GPU errors on large-scale HPC systems and the implications for system design and operation
2015
2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA)
We present a detailed study to provide a thorough understanding of GPU errors on a large-scale GPU-enabled system. ...
Titan, the world's second fastest supercomputer for open science in 2014, consists of more than 18,000 GPUs that scientists from various domains such as astrophysics, fusion, climate, and combustion use ...
Battelle, LLC for the U.S. ...
doi:10.1109/hpca.2015.7056044
dblp:conf/hpca/TiwariGRMRVOLDN15
fatcat:smw3cz64rfcxtouqu4z3sqqz3y
Winograd Convolution: A Perspective from Fault Tolerance
[article]
2022
arXiv
pre-print
Then, we explore the use of fault tolerance of winograd convolution for either fault-tolerant or energy-efficient NN processing. ...
According to our experiments, winograd convolution can be utilized to reduce fault-tolerant design overhead by 27.49\% or energy consumption by 7.19\% without any accuracy loss compared to that without being aware ...
fault tolerance of DNNs for either soft error mitigation or computing energy reduction. ...
arXiv:2202.08675v1
fatcat:clnipyq3sbbstkez2kbkdrmtlq
The visual vulnerability spectrum
2006
Proceedings of the 21st ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware - GH '06
Current trends, expected to continue, show soft error rates increasing exponentially at a rate of 8% per technology generation. ...
With this analysis in hand, we suggest several targeted, inexpensive solutions that can mitigate the most egregious of soft error consequences. ...
Future designs must be more aware of such low-level physical challenges. A transient, single bit corruption in a microelectronic circuit is termed a soft error. ...
doi:10.1145/1283900.1283902
fatcat:oonrcxopyvcd3f4sfweczzf7ia
Application-Based Fault Tolerance Techniques for Fully Protecting Sparse Matrix Solvers
2017
2017 IEEE International Conference on Cluster Computing (CLUSTER)
ACKNOWLEDGMENTS The authors would like to thank EPSRC for funding this research. ...
We also extend thanks to the Intel Parallel Computing Centre at the University of Bristol, for providing access to the Zoo testbed, and to GW4 for providing access to their Tier 2 Isambard supercomputer ...
Fig. 9 . 9 Runtime overheads for the ABFT techniques for protecting the dense double precision floating point vectors. ...
doi:10.1109/cluster.2017.49
dblp:conf/cluster/PawelczakMPM17
fatcat:sl67izpvmffipczzwpcppe5vl4
A Hardware Redundancy and Recovery Mechanism for Reliable Scientific Computation on Graphics Processors
[article]
2007
Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware - HWWS '04
We present a hardware redundancy-based approach to reliability for general purpose computation on GPUs that requires minimal change to existing GPU architectures. ...
Upon detecting an error, the system invokes an automatic recovery mechanism that only recomputes erroneous results. ...
'Transient fault' and 'transient error' are more general terms that include soft errors. Not all errors are cause for concern. ...
doi:10.2312/eggh/eggh07/055-064
fatcat:5rjsjfzxrvc3nas2fz3w5yn5oy
Soft Error Resilience of Deep Residual Networks for Object Recognition
2020
IEEE Access
GPUs have proven to be the major accelerator for CNN models. However, modern GPUs are prone to radiation-induced soft errors, which is a serious issue in safety-compliant systems. ...
INDEX TERMS Convolutional neural networks, residual networks, safety-critical systems, GPUs, reliability, soft error, selective hardening. ...
Section III provides a brief background on ResNets, GPUs and the mechanism of soft errors in GPUs. Section IV describes our experimental setup. ...
doi:10.1109/access.2020.2968129
fatcat:qsni4ga5ojbydo36nnicw2b6b4
Fine-grained bit-flip protection for relaxation methods
2016
Journal of Computational Science
As part of a push towards a resilient HPC ecosystem, in this paper we propose an error-resilient iterative solver for sparse linear systems based on stationary component-wise relaxation methods. ...
Our experimental study with sparse incomplete factorizations from a collection of real-world applications, and a practical GPU implementation, exposes the convergence delay incurred by the fault-tolerant ...
The fault-tolerant variant FTJacobi integrates the soft-error protection defined by ( 5 )-( 6 ). An implementation for the bit-flip protection is given in Figure 1 . ...
doi:10.1016/j.jocs.2016.11.013
fatcat:5czmzc66yja7rfsiv6d7xdagt4
Fault-Aware Design and Training to Enhance DNNs Reliability with Zero-Overhead
[article]
2022
arXiv
pre-print
For instance, the radiation-induced misprediction probability can be so high to impede a safe deployment of DNNs models at scale, urging the need for efficient and effective hardening solutions. ...
by soft errors induced by ionising particles strikes. ...
Despite the low error rate per device (in the order of one error every 3-4 years, considering a natural flux of 13 neutrons/cm 2 /h [4] , for modern GPUs [3] , [5] ), the foreseen large-scale adoption ...
arXiv:2205.14420v1
fatcat:pbr3dm6y2bhwpc4gvjfcun4w7q
High Performance Dense Linear System Solver with Soft Error Resilience
2011
2011 IEEE International Conference on Cluster Computing
checkpointing algorithm to protect the left factor that is needed for recovering x from soft error. ...
error at all due to error propagation and lack of error awareness. ...
Lately, iterative solvers were evaluated for soft error vulnerability [22] , [23] for sparse matrix system, and this shows the recent awareness of soft error for solving large scale problem. ...
doi:10.1109/cluster.2011.38
dblp:conf/cluster/DuLD11
fatcat:evh4vbkl6bfk7a4numnaokb6pm
Towards a Safety Case for Hardware Fault Tolerance in Convolutional Neural Networks Using Activation Range Supervision
[article]
2021
arXiv
pre-print
Real-world implementations will need to guarantee their robustness against hardware soft errors corrupting the underlying platform memory. ...
Based on the previously observed efficacy of activation clipping techniques, we build a prototypical safety case for classifier CNNs by demonstrating that range supervision represents a highly reliable ...
Parity or error-correcting code (ECC) can protect memory elements against single soft errors [5, 13] . ...
arXiv:2108.07019v1
fatcat:7e66xtrwd5dqfkiw2dw5vfu72y
2020-2021 Index IEEE Transactions on Computers Vol. 70
2021
IEEE transactions on computers
The Author Index contains the primary entry for each item, listed under the first author's name. ...
., +, TC Sept. 2021 1388-1400 Soft Error Tolerant Count Min Sketches. ...
Zhao, S., +, TC July 2021 1006-1018 Soft Error Tolerant Count Min Sketches. ...
doi:10.1109/tc.2021.3134810
fatcat:p5otlsapynbwvjmqogj47kv5qa
Autotuning GEMM Kernels for the Fermi GPU
2012
IEEE Transactions on Parallel and Distributed Systems
arithmetic and memory protected with error correction codes. ...
This paper presents a methodology for producing matrix multiplication kernels tuned for a specific architecture, through a canonical process of heuristic autotuning, based on generation of multiple code ...
Algebra for GPU and Multicore Architectures (MAGMA) for Large Petascale Systems," Georgia Institute of Technology subcontract #RA241-G1 funded by NSF grant #OCI-0910735, "Keeneland: National Institute ...
doi:10.1109/tpds.2011.311
fatcat:nc7hsw2vhfgyvjio6vjo2mtpca
« Previous
Showing results 1 — 15 out of 1,221 results