16,114 Hits in 9.0 sec

A global-state-triggered fault injector for distributed system evaluation

Ramesh Chandra, R.M. Lefever, K.R. Joshi, M. Cukier, W.H. Sanders
2004 IEEE Transactions on Parallel and Distributed Systems  
In Loki, faults are injected based on a partial view of the global state of the system, and a post-runtime analysis is performed to place events and injections into a single global timeline and to discard  ...  The fact that distributed systems can fail in subtle ways that depend on the state of multiple parts of the system suggests that a global-state-based fault injection mechanism should be used to validate  ...  tool has tackled the challenging problem of global-state-based distributed system fault injection by tying together fault injection based on a partial view of the global state, optimistic synchronization  ... 
doi:10.1109/tpds.2004.14 fatcat:r2i4rwst4rh73c4fvecm6ys45m

Experimental Evaluation of the Unavailability Induced by a Group Membership Protocol [chapter]

Kaustubh R. Joshi, Michel Cukier, William H. Sanders
2002 Lecture Notes in Computer Science  
This paper experimentally evaluates the blocking behavior of the group membership protocol of the Ensemble group communication system using a novel global-state-based fault injection technique.  ...  In doing so, we demonstrate how a layered distributed protocol such as the Ensemble group membership protocol can be modeled in terms of a state machine abstraction, and show how the resulting global state  ...  Acknowledgments This material is based on work supported by the National Science Foundation under Grant No. 0086096.  ... 
doi:10.1007/3-540-36080-8_15 fatcat:crihvxdrqzgzjd6i6d5ypohnp4

Dynamic node management and measure estimation in a state-driven fault injector

2000 Conference Proceedings 2000 International Conference on Mathematical Methods in Electromagnetic Theory (Cat. No.00EX413)  
To address these challenges, the Loki fault injector injects faults based on a partial view of the global state of a distributed system, and performs a post-runtime analysis using an off-line clock synchronization  ...  Validation of distributed systems using fault injection is difficult because of their inherent complexity, lack of a global clock, and lack of an easily accessible notion of a global state.  ...  Loki can inject faults in a distributed system based on a partial view of its global state obtained using notifications, and can determine, using a post-runtime analysis, whether each fault was injected  ... 
doi:10.1109/mmet.2000.6241463 fatcat:zogoffwmqfenbcscxyfwrtvj6m

A language-driven tool for fault injection in distributed systems

W. Hoara, S. Tixeuil
2005 The 6th IEEE/ACM International Workshop on Grid Computing, 2005.  
We also present FCI, the FAIL Cluster Implementation, that consists of a compiler, a runtime library and a middleware platform for software fault injection in distributed applications.  ...  Being able to test the behavior of a distributed program in an environment where we can control the faults (such as the crash of a process) is an important feature that matters in the deployment of reliable  ...  It is based on a partial view of the global state of the distributed system. The faults are injected based on a global state of the system.  ... 
doi:10.1109/grid.2005.1542742 dblp:conf/grid/HoarauT05 fatcat:fd6ppl7hljfnrfqy3qtje7bwam

Fault injection in distributed Java applications

W. Hoarau, S. Tixeuil, F. Vauchelles
2006 Proceedings 20th IEEE International Parallel & Distributed Processing Symposium  
In this paper, we investigate the possibility of injecting software faults in distributed java applications. Our scheme is by extending the FAIL-FCI software.  ...  Being able to test the behaviour of a distributed program in an environment where we can control the faults (such as the crash of a process) is an important feature that matters in the deployment of reliable  ...  LOKI [5] is a fault injector dedicated to distributed systems. It is based on a partial view of the global state of the distributed system.  ... 
doi:10.1109/ipdps.2006.1639507 dblp:conf/ipps/HoarauTV06 fatcat:bo6ercoefzgfvl4zc6lzgmcasy

A Framework for Experimental Validation and Performance Evaluation in Fault Tolerant Distributed System

Hein Meling
2007 2007 IEEE International Parallel and Distributed Processing Symposium  
The framework provides a facility to execute experiments in a configured target system. It is based on injecting faults or other events needed to test the fault handling capability of the system.  ...  In this paper, a framework for experimental validation and performance evaluation of fault management in a fault tolerant distributed system is presented.  ...  Faults are injected based on a partial view of the global state of a system, i.e. faults injected on one node of the system can depend on the state of other nodes.  ... 
doi:10.1109/ipdps.2007.370600 dblp:conf/ipps/Meling07 fatcat:fcyb7vucffavfh75xhc534okmm

An Approach to Experimentally Obtain Service Dependability Characteristics of the Jgroup/ARM System [chapter]

Bjarne E. Helvik, Hein Meling, Alberto Montresor
2005 Lecture Notes in Computer Science  
This paper describes an approach based on stratified sampling combined with fault injections for estimating the dependability attributes of a service deployed using the Jgroup/ARM middleware framework.  ...  Jgroup/ARM is a middleware framework for operating dependable distributed applications based on Java.  ...  The state diagram is not used to control fault injections based on triggers on a subset of the global state space as in [8] ; instead it is only used during offline, a posteriori analysis of fault injection  ... 
doi:10.1007/11408901_13 fatcat:bxpt66ugkbb2blskxe3ufn55cu

Improving Usability of Fault Injection

D. Cotroneo, L. De Simone, A.K. Iannillo, A. Lanzaro, R. Natella
2014 2014 IEEE International Symposium on Software Reliability Engineering Workshops  
The lack of tools that can fit in existing development practices and processes hampers the adoption of Software Fault Injection (SFI) in real-world projects.  ...  This paper presents an ongoing work towards an SFI tool integrated in the Eclipse IDE, and designed for usability.  ...  ACKNOWLEDGMENT This work has been partially supported by the SVEVIA PON Project (PON02 00485 3487758) funded by the Italian Ministry of Education, University and Research.  ... 
doi:10.1109/issrew.2014.37 dblp:conf/issre/CotroneoSILN14 fatcat:knnvmiqhyfhgjixoxtdktuzv7m

Software patterns for fault injection in CPS engineering

Nicolas Navet, Ivan Cibrario Bertolotti, Tingting Hu
2017 2017 22nd IEEE International Conference on Emerging Technologies and Factory Automation (ETFA)  
Software fault injection is a powerful technique to evaluate the robustness of an application and guide in the choice of fault-tolerant mechanisms.  ...  Then, illustrating on the domain-specific language CPAL, we present injection patterns that can be embedded in the application code and discuss the types of faults each supports, as well as implementation  ...  Moreover, this way of modeling corresponds to a centralized fault injection mechanism, even when modeling a distributed system, which is close to practice.  ... 
doi:10.1109/etfa.2017.8247701 dblp:conf/etfa/NavetBH17 fatcat:kovjovaafjh6re6n7ofa5at7za

Redundancy Does Not Imply Fault Tolerance

Aishwarya Ganesan, Ramnatthan Alagappan, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau
2017 ACM Transactions on Storage  
The behavior of modern distributed storage systems in response to file-system faults is critical and strongly affects cloud-based services.  ...  Because distributed storage systems inherently store redundant copies of data and we inject only one fault at a time, these behaviors are surprising and undesirable.  ...  We thank the members of the ADSL and the developers of CockroachDB, LogCabin, Redis, RethinkDB, and ZooKeeper for their valuable discussions.  ... 
doi:10.1145/3125497 fatcat:cfqbs2zy6zagrnrfgmzhfevgyu

A Distributed Approach to Autonomous Fault Treatment in Spread

Hein Meling, Joakim L. Gilje
2008 2008 Seventh European Dependable Computing Conference  
The objective of DARM is to improve the dependability characteristics of systems through a fault treatment mechanism.  ...  This paper presents the design and implementation of the Distributed Autonomous Replication Management (DARM) framework built on top of the Spread group communication system.  ...  Acknowledgements This work was partially supported by a scholarship from Telenor iLabs. The authors wish to thank B. Helvik and the anonymous reviewers for useful comments on this paper.  ... 
doi:10.1109/edcc-7.2008.12 dblp:conf/edcc/MelingG08 fatcat:ek5rfjarhbhibeo7wljpqagrsi

Diagnosis of Performance Faults in LargeScale MPI Applications via Probabilistic Progress-Dependence Inference

Ignacio Laguna, Dong H. Ahn, Bronis R. de Supinski, Saurabh Bagchi, Todd Gamblin
2015 IEEE Transactions on Parallel and Distributed Systems  
We extensively evaluate fault coverage of our technique via fault injections in ten HPC benchmarks and show that our analysis takes less than a few seconds on thousands of parallel tasks.  ...  This paper presents a novel technique that scalably infers the tasks in a parallel program on which a failure occurred, as well as the code in which it originated.  ...  ACKNOWLEDGMENTS The authors thank David Richards of the Lawrence Livermore National Laboratory for helping us to conduct the blind study on ddcMD and to validate the results.  ... 
doi:10.1109/tpds.2014.2314100 fatcat:zgqp4z2gvngongmoslwc6fjqpu

Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions

Aishwarya Ganesan, Ramnatthan Alagappan, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau
2017 USENIX Conference on File and Storage Technologies  
Our results have implications for the design of next generation fault-tolerant distributed and cloud storage systems.  ...  We analyze how modern distributed storage systems behave in the presence of file-system faults such as data corruption and read and write errors.  ...  Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and may not reflect the views of NSF, DOE, or other institutions.  ... 
dblp:conf/fast/GanesanAAA17 fatcat:m7fl2qfya5h2jjrd7dkb6htkty

Jgroup/ARM: a distributed object group platform with autonomous replication management

Hein Meling, Alberto Montresor, Bjarne E. Helvik, Ozalp Babaoglu
2008 Software, Practice & Experience  
A state merging service (SMS) is provided to simplify the reestablishment of a consistent global state when partitions merge.  ...  Finally, the task of SMS is to support developers in re-establishing a consistent global state when two or more partitions merge by handling state diffusion to other partitions.  ...  ACKNOWLEDGEMENTS The authors wish to thank Heine Kolltveit and Rohnny Moland for commenting on the discussion of replicated transactions.  ... 
doi:10.1002/spe.853 fatcat:zakxyjznwrbnln2zkqagsku7hu

Model-based testing of global properties on large-scale distributed systems

Gerson Sunyé, Eduardo Cunha de Almeida, Yves Le Traon, Benoit Baudry, Jean-Marc Jézéquel
2014 Information and Software Technology  
The defect would not be detected without a global view of the system.  ...  A key concern of any large-scale distributed system is the validation of global properties, which cannot be evaluated on a single node.  ...  Routing tables In a large-scale distributed system, nodes have a partial view of the system, i.e., their routing tables keep only a subset of other node addresses.  ... 
doi:10.1016/j.infsof.2014.02.002 fatcat:u7izxwsd3vhzvcylzwt7tgyu7y
« Previous Showing results 1 — 15 out of 16,114 results