Carnegie Mellon's CyDAT: Harnessing a Wide Array of Telemetry Data to Enhance Distributed System Diagnostics

Chas DiFatta, Mark Poepping, Daniel V. Klein
2008 USENIX Symposium on Operating Systems Design and Implementation  
The number and complexity of distributed applications has exploded, and to-date, each has had to create its own method for providing diagnostic tools and performance metrics. These distributed services have become increasingly dependent, not only on the system and network infrastructures upon which they are built, but also each other. The effectiveness of a diagnostician is seriously hindered by the difficulty in accessing diagnostic data. However, even when access can be gained, it exposes the
more » ... daunting challenge of correlating a myriad of different data formats and an incredible amount of data (both in static files and real time streams). To say that diagnosis of distributed systems is a complex and difficult is a vast understatement; and the task is getting tougher every day. There is a paucity of tools, data mining methods and logfile standards that has been worsening for years. Researchers face the same difficulty in gaining access to data for purposes of experimentation. Responding to these difficulties, we've established the CyDAT (Cyber-center for Diagnostics Analytics and Telemetry) effort within CyLAB at Carnegie Mellon, to enable researchers to interact with a rich and varied set of data in an open, multi-vendor environment that enables and supports open, interdisciplinary research. This paper described the CyDAT and a reference implementation of an event framework (EDDY) to normalize, transform, and transport telemetry data to the analytics that need them, providing a means for tackling the diagnostic Hydra.
dblp:conf/osdi/DiFattaPK08 fatcat:mvnhvm2bwjdx5kuwu2v3fkgug4