Autonomous, failure-resilient orchestration of distributed discrete event simulations

Matthew Malensek, Zhiquan Sui, Neil Harvey, Shrideep Pallickara
2013 Proceedings of the 2013 ACM Cloud and Autonomic Computing Conference on - CAC '13  
Discrete event simulations model the behavior of complex, real-world systems. Simulating a wide range of relevant events and conditions naturally provides a more accurate model, but also increases the computational workload associated with the simulation. To manage these processing requirements in a scalable manner, a discrete event simulation can be distributed across a number of computing resources. However, individual tasks in the simulation are stateful, and therefore require inter-task
more » ... unication and synchronization to produce an accurate model. This property not only complicates the orchestration of the discrete event simulation in a distributed setting, but also makes providing reliable, fault-tolerant execution a challenge, especially when compared to conventional distributed fault tolerance schemes. In this paper, we propose an autonomous agent that provides fault tolerance functionality for discrete event simulations by predicting state changes in the simulation and adjusting its fault tolerance policy accordingly. This allows the system to avoid negatively impacting overall execution times while preserving reliability guarantees. To underscore the viability of our solution, we provide benchmarks of a production discrete event simulation that can sustain failures while running under the supervision of our fault tolerance framework.
doi:10.1145/2494621.2494625 dblp:conf/cac/MalensekSHP13 fatcat:gpiqqp76eraava2u7efefzwir4