Timely Error Detection for Effective Recovery in Light-Lockstep Automotive Systems
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
Safety-relevant systems in the automotive domain often implement features such as lockstep execution for error detection, and reset and reexecution for error correction. Lightlockstep has already been adopted in some such systems due to its relatively low implementation cost given that it does not require deep changes into non-lockstep hardware. Instead, as only off-core activities (i.e. data/addresses sent) need to be compared across different cores, light-lockstep designs are lowly intrusive.
... re lowly intrusive. This approach has been proven sufficient to guarantee functional correctness of the system in the presence of errors in the cores, in particular in relation with certification against safety standards such as ISO26262 in the automotive domain. However, error detection in light-lockstep systems may occur long after the error actually occurs, thus jeopardising timing guarantees, which are as critical as functional ones in hard real-time systems. In this paper we analyse the timing behaviour of errors due to transient and permanent faults in light-lockstep systems. Our results show that the time elapsed until an error is detected can be inordenately large, especially for permanent faults. Based on this observation and building upon the specific characteristics of light-lockstep systems, we propose LiVe (Lightly Verbose), a new mechanism to enforce the early detection of errors, due to both transient and permanent faults, thus enabling the computation of tight error detection timing bounds. We also analyse how existing mechanisms for error recovery in multicore systems increase their effectiveness when light-lockstep operates in LiVe mode in the context of mixed-criticality workloads.