Filters








1,221 Hits in 4.7 sec

Precision-aware soft error protection for GPUs

David J. Palframan, Nam Sung Kim, Mikko H. Lipasti
2014 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)  
We therefore propose a novel precision-aware protection approach for the GPU execution logic and register file to mitigate large magnitude errors.  ...  With the advent of general-purpose GPU computing, it is becoming increasingly desirable to protect GPUs from soft errors.  ...  Section 2 discusses prior proposals for soft error mitigation and motivates precision-aware protection.  ... 
doi:10.1109/hpca.2014.6835966 dblp:conf/hpca/PalframanKL14 fatcat:uiipqczf2rb4zlrvlctmolsl6a

2018 Index IEEE Transactions on Computers Vol. 67

2019 IEEE transactions on computers  
., þ, TC July 2018 1039-1045 Efficient Protection of the Register File in Soft-Processors Implemented on Xilinx FPGAs.  ...  ., þ, TC Dec. 2018 1703-1719 Optimization A GPU-Aware Parallel Index for Processing High-Dimensional Big Data.  ... 
doi:10.1109/tc.2018.2882120 fatcat:j2j7yw42hnghjoik2ghvqab6ti

The Visual Vulnerability Spectrum: Characterizing Architectural Vulnerability for Graphics Hardware [article]

Jeremy W. Sheaffer, David P. Luebke, Kevin Skadron
2006 Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware - HWWS '04  
Current trends, expected to continue, show soft error rates increasing exponentially at a rate of 8% per technology generation.  ...  With this analysis in hand, we suggest several targeted, inexpensive solutions that can mitigate the most egregious of soft error consequences.  ...  We would like to extend out sincere thanks to the anonymous reviewers for their detailed and helpful comments.  ... 
doi:10.2312/eggh/eggh06/009-016 fatcat:itfqaqqprjbmrperxlhvdfttyy

Understanding GPU errors on large-scale HPC systems and the implications for system design and operation

Devesh Tiwari, Saurabh Gupta, James Rogers, Don Maxwell, Paolo Rech, Sudharshan Vazhkudai, Daniel Oliveira, Dave Londo, Nathan DeBardeleben, Philippe Navaux, Luigi Carro, Arthur Bland
2015 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA)  
We present a detailed study to provide a thorough understanding of GPU errors on a large-scale GPU-enabled system.  ...  Titan, the world's second fastest supercomputer for open science in 2014, consists of more than 18,000 GPUs that scientists from various domains such as astrophysics, fusion, climate, and combustion use  ...  Battelle, LLC for the U.S.  ... 
doi:10.1109/hpca.2015.7056044 dblp:conf/hpca/TiwariGRMRVOLDN15 fatcat:smw3cz64rfcxtouqu4z3sqqz3y

Winograd Convolution: A Perspective from Fault Tolerance [article]

Xinghua Xue, Haitong Huang, Cheng Liu, Ying Wang, Tao Luo, Lei Zhang
2022 arXiv   pre-print
Then, we explore the use of fault tolerance of winograd convolution for either fault-tolerant or energy-efficient NN processing.  ...  According to our experiments, winograd convolution can be utilized to reduce fault-tolerant design overhead by 27.49\% or energy consumption by 7.19\% without any accuracy loss compared to that without being aware  ...  fault tolerance of DNNs for either soft error mitigation or computing energy reduction.  ... 
arXiv:2202.08675v1 fatcat:clnipyq3sbbstkez2kbkdrmtlq

The visual vulnerability spectrum

Jeremy W. Sheaffer, David P. Luebke, Kevin Skadron
2006 Proceedings of the 21st ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware - GH '06  
Current trends, expected to continue, show soft error rates increasing exponentially at a rate of 8% per technology generation.  ...  With this analysis in hand, we suggest several targeted, inexpensive solutions that can mitigate the most egregious of soft error consequences.  ...  Future designs must be more aware of such low-level physical challenges. A transient, single bit corruption in a microelectronic circuit is termed a soft error.  ... 
doi:10.1145/1283900.1283902 fatcat:oonrcxopyvcd3f4sfweczzf7ia

Application-Based Fault Tolerance Techniques for Fully Protecting Sparse Matrix Solvers

Grzegorz Pawelczak, Simon McIntosh-Smith, James Price, Matt Martineau
2017 2017 IEEE International Conference on Cluster Computing (CLUSTER)  
ACKNOWLEDGMENTS The authors would like to thank EPSRC for funding this research.  ...  We also extend thanks to the Intel Parallel Computing Centre at the University of Bristol, for providing access to the Zoo testbed, and to GW4 for providing access to their Tier 2 Isambard supercomputer  ...  Fig. 9 . 9 Runtime overheads for the ABFT techniques for protecting the dense double precision floating point vectors.  ... 
doi:10.1109/cluster.2017.49 dblp:conf/cluster/PawelczakMPM17 fatcat:sl67izpvmffipczzwpcppe5vl4

A Hardware Redundancy and Recovery Mechanism for Reliable Scientific Computation on Graphics Processors [article]

Jeremy W. Sheaffer, David P. Luebke, Kevin Skadron
2007 Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware - HWWS '04  
We present a hardware redundancy-based approach to reliability for general purpose computation on GPUs that requires minimal change to existing GPU architectures.  ...  Upon detecting an error, the system invokes an automatic recovery mechanism that only recomputes erroneous results.  ...  'Transient fault' and 'transient error' are more general terms that include soft errors. Not all errors are cause for concern.  ... 
doi:10.2312/eggh/eggh07/055-064 fatcat:5rjsjfzxrvc3nas2fz3w5yn5oy

Soft Error Resilience of Deep Residual Networks for Object Recognition

Y. Ibrahim, H.-B. Wang, M. Bai, Z. Liu, J.-A. Wang, Z.-M. Yang, Z.-M. Chen
2020 IEEE Access  
GPUs have proven to be the major accelerator for CNN models. However, modern GPUs are prone to radiation-induced soft errors, which is a serious issue in safety-compliant systems.  ...  INDEX TERMS Convolutional neural networks, residual networks, safety-critical systems, GPUs, reliability, soft error, selective hardening.  ...  Section III provides a brief background on ResNets, GPUs and the mechanism of soft errors in GPUs. Section IV describes our experimental setup.  ... 
doi:10.1109/access.2020.2968129 fatcat:qsni4ga5ojbydo36nnicw2b6b4

Fine-grained bit-flip protection for relaxation methods

Hartwig Anzt, Jack Dongarra, Enrique S. Quintana-Ortí
2016 Journal of Computational Science  
As part of a push towards a resilient HPC ecosystem, in this paper we propose an error-resilient iterative solver for sparse linear systems based on stationary component-wise relaxation methods.  ...  Our experimental study with sparse incomplete factorizations from a collection of real-world applications, and a practical GPU implementation, exposes the convergence delay incurred by the fault-tolerant  ...  The fault-tolerant variant FTJacobi integrates the soft-error protection defined by ( 5 )-( 6 ). An implementation for the bit-flip protection is given in Figure 1 .  ... 
doi:10.1016/j.jocs.2016.11.013 fatcat:5czmzc66yja7rfsiv6d7xdagt4

Fault-Aware Design and Training to Enhance DNNs Reliability with Zero-Overhead [article]

Niccolò Cavagnero, Fernando Dos Santos, Marco Ciccone, Giuseppe Averta, Tatiana Tommasi, Paolo Rech
2022 arXiv   pre-print
For instance, the radiation-induced misprediction probability can be so high to impede a safe deployment of DNNs models at scale, urging the need for efficient and effective hardening solutions.  ...  by soft errors induced by ionising particles strikes.  ...  Despite the low error rate per device (in the order of one error every 3-4 years, considering a natural flux of 13 neutrons/cm 2 /h [4] , for modern GPUs [3] , [5] ), the foreseen large-scale adoption  ... 
arXiv:2205.14420v1 fatcat:pbr3dm6y2bhwpc4gvjfcun4w7q

High Performance Dense Linear System Solver with Soft Error Resilience

Peng Du, Piotr Luszczek, Jack Dongarra
2011 2011 IEEE International Conference on Cluster Computing  
checkpointing algorithm to protect the left factor that is needed for recovering x from soft error.  ...  error at all due to error propagation and lack of error awareness.  ...  Lately, iterative solvers were evaluated for soft error vulnerability [22] , [23] for sparse matrix system, and this shows the recent awareness of soft error for solving large scale problem.  ... 
doi:10.1109/cluster.2011.38 dblp:conf/cluster/DuLD11 fatcat:evh4vbkl6bfk7a4numnaokb6pm

Towards a Safety Case for Hardware Fault Tolerance in Convolutional Neural Networks Using Activation Range Supervision [article]

Florian Geissler, Syed Qutub, Sayanta Roychowdhury, Ali Asgari, Yang Peng, Akash Dhamasia, Ralf Graefe, Karthik Pattabiraman, Michael Paulitsch
2021 arXiv   pre-print
Real-world implementations will need to guarantee their robustness against hardware soft errors corrupting the underlying platform memory.  ...  Based on the previously observed efficacy of activation clipping techniques, we build a prototypical safety case for classifier CNNs by demonstrating that range supervision represents a highly reliable  ...  Parity or error-correcting code (ECC) can protect memory elements against single soft errors [5, 13] .  ... 
arXiv:2108.07019v1 fatcat:7e66xtrwd5dqfkiw2dw5vfu72y

2020-2021 Index IEEE Transactions on Computers Vol. 70

2021 IEEE transactions on computers  
The Author Index contains the primary entry for each item, listed under the first author's name.  ...  ., +, TC Sept. 2021 1388-1400 Soft Error Tolerant Count Min Sketches.  ...  Zhao, S., +, TC July 2021 1006-1018 Soft Error Tolerant Count Min Sketches.  ... 
doi:10.1109/tc.2021.3134810 fatcat:p5otlsapynbwvjmqogj47kv5qa

Autotuning GEMM Kernels for the Fermi GPU

Jakub Kurzak, Stanimire Tomov, Jack Dongarra
2012 IEEE Transactions on Parallel and Distributed Systems  
arithmetic and memory protected with error correction codes.  ...  This paper presents a methodology for producing matrix multiplication kernels tuned for a specific architecture, through a canonical process of heuristic autotuning, based on generation of multiple code  ...  Algebra for GPU and Multicore Architectures (MAGMA) for Large Petascale Systems," Georgia Institute of Technology subcontract #RA241-G1 funded by NSF grant #OCI-0910735, "Keeneland: National Institute  ... 
doi:10.1109/tpds.2011.311 fatcat:nc7hsw2vhfgyvjio6vjo2mtpca
« Previous Showing results 1 — 15 out of 1,221 results