The Error Reporting in the ATLAS TDAQ System

Serguei Kolos, Andrei Kazarov, Lykourgos Papaevgeniou
2015 Journal of Physics, Conference Series  
The ATLAS Error Reporting provides a service that allows experts and shift crew to track and address errors relating to the data taking components and applications. This service, called the Error Reporting Service (ERS), gives to software applications the opportunity to collect and send comprehensive data about run-time errors, to a place where it can be intercepted in real-time by any other system component. Other ATLAS online control and monitoring tools use the ERS as one of their main
more » ... to address system problems in a timely manner and to improve the quality of acquired data. The actual destination of the error messages depends solely on the run-time environment, in which the online applications are operating. When an application sends information to ERS, depending on the configuration, it may end up in a local file, a database, distributed middleware which can transport it to an expert system or display it to users. Thanks to the open framework design of ERS, new information destinations can be added at any moment without touching the reporting and receiving applications. The ERS Application Program Interface (API) is provided in three programming languages used in the ATLAS online environment: C++, Java and Python. All APIs use exceptions for error reporting but each of them exploits advanced features of a given language to simplify the end-user program writing. For example, as C++ lacks language support for exceptions, a number of macros have been designed to generate hierarchies of C++ exception classes at compile time. Using this approach a software developer can write a single line of code to generate a boilerplate code for a fully qualified C++ exception class declaration with arbitrary number of parameters and multiple constructors, which encapsulates all relevant static information about the given type of issues. When a corresponding error occurs at run time, the program just need to create an instance of that class passing relevant values to one of the available class constructors and send this instance to ERS. This paper presents the original design solutions exploited for the ERS implementation and describes how it was used during the first ATLAS run period. The cross-system error reporting standardization introduced by ERS was one of the key points for the successful implementation of automated mechanisms for online error recovery.
doi:10.1088/1742-6596/608/1/012004 fatcat:b6gpfuj6b5dxtkltfjrfl4t3ay