Fault Tolerant Electronic System Design
Boyang Du, Luca Sterpone
2016
Due to technology scaling, which means reduced transistor size, higher density, lower voltage and more aggressive clock frequency, VLSI devices may become more sensitive against soft errors. Especially for those devices used in safety-and mission-critical applications, dependability and reliability are becoming increasingly important constraints during the development of system on/around them. Other phenomena (e.g., aging and wear-out effects) also have negative impacts on reliability of modern
more »
... circuits. Recent researches show that even at sea level, radiation particles can still induce soft errors in electronic systems. Online error detection and Board-level functional test in processor-based system On one hand, processor-based system are commonly used in a wide variety of applications, including safety-critical and high availability missions, e.g., in the automotive, biomedical and aerospace domains. In these fields, an error may produce catastrophic consequences. Thus, dependability is a primary target that must be achieved taking into account tight constraints in terms of cost, performance, power and time to market. Several solutions exist, acting either on hardware or software: however, they all have to face the high efforts required for designing, manufacturing, testing and qualifying processor-based systems. While standards and regulations (e.g., ISO-26262, DO-254, IEC-61508) clearly specify the targets to be achieved and the methods to prove their achievement. In this scenario, techniques working at system level (i.e., without changing the technology and the processor) are particularly attracting, especially if they can effectively meet dependability needs more efficiently without changes in the existing hardware and software. Approaches to detect soft errors in processor-based systems are traditionally divided in techniques that deal with faults affecting the data and faults affecting the execution flow of the software application. For the faults affecting the data, to detect and eventually correct such errors in the data, i.e. Data Error, either detection and correction strategies can be applied to the data memory itself, such as the Error Correction Coding (ECC), or the software or hardware (or both) needs to be modified so that certain redundancy could be applied, for example, variable duplication plus the instructions for checking data in the software. While for the faults affecting the execution flow, although part of them are overlapped with the faults affecting the data, for example, the faults corrupting variable used in branch instruction, the rest them are difficult to handle, such as the faults affecting ii the registers used in the pipeline of the processor. For mitigating soft errors affecting the execution flow, i.e. Control Flow Error (CFE), traditional Triple Modular Redundancy (TMR) could be an effective solution when it is applied at gate level of the processor, in case the netlist of the target processor is available, which is usually not the case when Commercial Off The Shelf (COTS) component is used, let alone the cost it introduces for verification of compliance to the standards and regulations as mentioned above, since the processor's hardware is modified. Avoiding the huge hardware overhead caused by TMR if applied at system level (>200%), solutions have been proposed to firstly detect the CFE either with extra instructions inserted into software or an extra component monitoring the processor (e.g. a watchdog processor); and then correction of CFE could be done either simply reset the processor or replying on further software techniques such as checkpoint rollback depending on the nature and requirement of workload for the processor. This work mainly focuses on online test for detecting CFEs in the first part, a hybrid solution is discussed afterwards. Since there already exists debug interface in many processors (standalone or cores), assisting designer for debugging hardware/software at different stages, which can provide information of the software running on the processor in a non-intrusive way (e.g., the debug interface in LEON3 processor), an external hardware module, namely CFC module, was proposed to be attached to the processor through debug/trace interface, to extract the information and monitor the execution of the software on the processor. With the debug interface in the processor as LEON3, the CFC module is able to extract executed instructions and the corresponding Program Counter (PC) value. Meanwhile the software running on the processor can be divided into Basic Blocks (BBs), in which all the instruction will be executed sequentially without branch or jump instructions. The main idea behind the CFC module is to calculate the signature of each BB executed by the processor and compare it with the signature previously stored in the table, namely CFC Signature Table in side the CFC module. The CFC module is greatly smaller than the processor itself in terms of area consumption. With data from simulation-based fault injection campaign on both LEON3 and miniMIPS processor with several benchmark applications, the proposed CFC module proved to be a non-intrusive, effective way for detecting CFEs without modifying the software and processor implementation. As the CFC module focuses only on the CFEs, a hybrid technique was proposed with dual control flow monitoring to detect soft errors, together with a software-based technique targeting on Data Errors. The hybrid technique consists of an external Hardware Monitor (HM) that also attaches to the debug/trace interface of the processor for extracting the same information as in CFC module. However, the HM also monitors the communication between the processor and the memory on the system bus. By extracting the processor's reading address sent to memory component, and the data it retrieves, the HM is able to get the input stream of the instructions fetched by the processor; and the information from the debug/trace interface provides the output stream of the instructions executed by the processor. Inside the HM, the input and output instruction streams are carefully synchronized and compared to detect occurrence of CFE, and a part of Data Error is also covered in this way, and in order to achieve full coverage including the Data Errors, a software-based technique, iii combining "Dataflow duplication" and "inverted branches" is applied. The fault injection campaign, emulating effects of Single Event Upset and Single Event Transient, was carried out, and the results verified the high fault coverage the hybrid technique can achieve, with a small hardware overhead. To further exploit the existing debug interface in the processor for testing purpose, Printed Circuit Board Assembly (PCBA) Power-On Self-Test (POST) was investigated for finding feasible solution to increase processor's observability. POST plays an important role in many systems, since it may detect faults arising during the life time of the product, thus increasing its dependability. POST may use different solutions, which should match the constraints of the environment the system is deployed in.
doi:10.6092/polito/porto/2644047
fatcat:w6pgvuqls5e2fllw6qccg4ibou