Fault tolerance on-chip: a reliable computing paradigm using self-test, self-diagnosis, and self-repair (3S) approach

Xiaowei Li, Guihai Yan, Jing Ye, Ying Wang
<span title="2018-05-24">2018</span> <i title="Springer Nature"> <a target="_blank" rel="noopener" href="https://fatcat.wiki/container/ikvx2lmj7rew7jpw4lygqgjpby" style="color: black;">Science China Information Sciences</a> </i> &nbsp;
If your computer crashes, you can revive it by a reboot, an empirical solution that usually turns out to be effective. The rationale behind this solution is that transient faults, either in hardware or software, can be fixed by refreshing the machine state. Such a "silver bullet", however, could be futile in the future because the faults, especially those existing in the hardware such as Integrated Circuit (IC) chips, cannot be eliminated by refreshing. What we need is a more sophisticated
more &raquo; ... nism to steer the system back to the right track. The "magic cure" is the Fault Tolerance On-Chip (FTOC) mechanism, which relies on a suite of built-in design-for-reliability logic, including fault detection, fault diagnosis, and error recovery, working in a self-supportive manner. We have exploited the FTOC to build a holistic solution ranging from on-chip fault detection to error recovery mechanisms to address faults caused by chips progressively aging. Besides fault detection, the FTOC paradigm provides attractive benefits, such as facilitating graceful performance degradation, mitigating the impact of verification blind spots, and improving the chip yield. Citation Li X W, Yan G H, Ye J, et al. Fault tolerance on-chip: a reliable computing paradigm using self-test, self-diagnosis, and self-repair (3S) approach. Sci China Inf Sci, 2018, 61(11): 112102, https://doi.self-diagnosis, self-repair, or "3S framework" for short. Some prototypes have been built to demonstrate how FTOC responds to the in-filed silicon degradation. More interestingly, we find that the 3S framework is not only a powerful backbone guiding various FTOC designs and implementations, but also has more far-reaching implications such as maintaining graceful performance degradation, mitigating the impact of verification blind spots, and improving the chip yield. We believe that these design principles will be critical for the chip makers to maintain a competitive edge in the future. This paper comes from a synthesis of our researches targeting the on-chip fault tolerance in the past decade. We re-visit a series of techniques and propose the "3S" paradigm to govern the diversified techniques which are intrinsically unified for the same purpose: namely hardware fault tolerance. At the edge of Denard scaling [7] and Moore's law, the IC chips, especially large scale, are prone to suffer more reliability challenges [8, 9] . We believe the proposed FTOC paradigm provides an alternative solution to this challenge. This paper clarifies the key motivation behind FTOC and showcases some key findings, in the hope of attracting more contributions. The rest of this paper is organized as follows: Section 2 presents the the IC chip lifetime reliability pathology, named as the "Sick Silicon" problem, which is the main application field of the proposed FTOC paradigm. Section 3 presents the limitation of conventional reliability design methodologies and justifies the FTOC paradigm. Section 4 details the major design components of FTOC, followed by the evaluation results in Section 5. Section 6 discusses three far-reaching implications of the FTOC paradigm. Section 7 concludes this paper.
<span class="external-identifiers"> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.1007/s11432-017-9290-4">doi:10.1007/s11432-017-9290-4</a> <a target="_blank" rel="external noopener" href="https://fatcat.wiki/release/3mwg4l5pyrashe6het5rlsnlue">fatcat:3mwg4l5pyrashe6het5rlsnlue</a> </span>
<a target="_blank" rel="noopener" href="https://web.archive.org/web/20210801112057/https://www.sciengine.com/doi/pdf/2cc0ee4584f1439aa9d10a34b1ea6f33" title="fulltext PDF download" data-goatcounter-click="serp-fulltext" data-goatcounter-title="serp-fulltext"> <button class="ui simple right pointing dropdown compact black labeled icon button serp-button"> <i class="icon ia-icon"></i> Web Archive [PDF] <div class="menu fulltext-thumbnail"> <img src="https://blobs.fatcat.wiki/thumbnail/pdf/1f/63/1f63a7c40a52e33fe54666189c2de72f061f59ff.180px.jpg" alt="fulltext thumbnail" loading="lazy"> </div> </button> </a> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.1007/s11432-017-9290-4"> <button class="ui left aligned compact blue labeled icon button serp-button"> <i class="external alternate icon"></i> springer.com </button> </a>