Filters








405 Hits in 9.5 sec

The effects of an armor-based sift environment on the performance and dependability of user applications

K. Whisnant, R.K. Iyer, Z.T. Kalbarczyk, P.H. Jones, D.A. Rennels, R. Some
2004 IEEE Transactions on Software Engineering  
Correlated failures affecting a SIFT process and application process are possible, but the division of detection and recovery responsibilities in the SIFT environment allows it to recover from these multiple  ...  The SIFT environment is built around a set of self-checking ARMOR processes running on different machines that provide error detection and recovery services to themselves and to the REE applications.  ...  The authors would like to thank the reviewers for the insightful comments and constructive suggestions. They also thank S. Chen for help in simulating the SAN model and F. Baker and T.  ... 
doi:10.1109/tse.2004.1274045 fatcat:abfflmgrk5a6tovrl4f3aicy24

A robust and lightweight stable leader election service for dynamic systems

Nicolas Schiper, Sam Toueg
2008 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN)  
Intuitively, distributed applications can use this service to elect and maintain an operational leader for any group of processes which may dynamically change.  ...  By using a stochastic failure detector [5] and a link quality estimator, it provides some degree of QoS control and it adapts to changing network conditions.  ...  The service we propose can be used to elect and maintain a leader among any dynamically changing subset of application processes (called a group) in a system with random process crashes, process recoveries  ... 
doi:10.1109/dsn.2008.4630089 dblp:conf/dsn/SchiperT08 fatcat:rawwo25lovfp5exfnnzsgnqfki

On the Quality of Service of Crash-Recovery Failure Detectors

Tiejun Ma, Jane Hillston, Stuart Anderson
2007 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07)  
We show that the fail-free run and the crash-stop run are special cases of the crash-recovery run with mean time to failure (MTTF) approaching to infinity and mean time to recovery (MTTR) approaching to  ...  We develop a probabilistic model of the behavior of a crash-recovery target, i.e. one which has the ability to recover from the crash state.  ...  From the descriptions in this section, we have shown that the interaction between the QoS of the Crash-Recovery FDS System Model We consider a distributed system model with two services: one FDS and  ... 
doi:10.1109/dsn.2007.70 dblp:conf/dsn/MaHA07 fatcat:jxp7tvulmfbwxi3nd7bmmuo4ku

Distributed transactions for reliable systems

Alfred Z. Spector, Dean Daniels, Daniel Duchamp, Jeffrey L. Eppinger, Randy Pausch
1985 ACM SIGOPS Operating Systems Review  
Various objects that use the facilities of TABS are exemplified and the performance of the system is discussed in detail.  ...  The paper concludes that the prototype provides useful facilities, and that it would be feasible to build a high performance implementation based on its ideas.  ...  Maxwell Berenson constructed the distributed performance monitoring system that made it possible to get accurate performance measurements of distributed transactions.  ... 
doi:10.1145/323627.323641 fatcat:im54sr7pwzgplg4ebfyggjcvxm

Distributed transactions for reliable systems

Alfred Z. Spector, Dean Daniels, Daniel Duchamp, Jeffrey L. Eppinger, Randy Pausch
1985 Proceedings of the tenth ACM symposium on Operating systems principles - SOSP '85  
Various objects that use the facilities of TABS are exemplified and the performance of the system is discussed in detail.  ...  The paper concludes that the prototype provides useful facilities, and that it would be feasible to build a high performance implementation based on its ideas.  ...  Maxwell Berenson constructed the distributed performance monitoring system that made it possible to get accurate performance measurements of distributed transactions.  ... 
doi:10.1145/323647.323641 dblp:conf/sosp/SpectorDDEP85 fatcat:ipwpgynrmrcv5bhs3t2gh4fqiy

ACM SIGACT news distributed computing column 32

Idit Keidar
2008 ACM SIGACT News  
Thanks to Cyril Gavoille, Phuong Hoai Ha, David Ilcinkas, Hyonho Lee, Yee Jiun Song, Gadi Taubenfeld, and Shmuel Zaks for sharing their excellent photos.  ...  "Optimistic Erasure Coded Distributed Storage" by Dutta et al. is about using erasure codes to implement atomic registers in the asynchronous crash-recovery message passing model.Session 4: Shared memory  ...  In "Dynamic Routing and Location Services in Metrics of Low Doubling Dimension", Konjevod et al. present a dynamic compact routing algorithm that has applications to adhoc mobile networks and distributed  ... 
doi:10.1145/1466390.1466402 fatcat:ygqihi4p5vfnfnmzufwi2fnyk4

Distributed fault tolerance: lessons from Delta-4

D. Powell
1994 IEEE Micro  
Many persons contributed to the Delta-4 project over its six-year lifetime.  ...  Special and very personal thanks must of course go to David, David, Doug, Gottfried. Marc. Pascal, Paulo. Peter, and Santosh, as well as all my compatriots in the Dependable Computing Group at LA.6  ...  Delta-4 assumes the node to be faulty and passivates it by removing it from the system exactly as if it had crashed.  ... 
doi:10.1109/40.259898 fatcat:6435u4ycijetdhzbpguzsybmja

Fault Tolerance in MapReduce: A Survey [chapter]

Bunjamin Memishi, Shadi Ibrahim, María S. Pérez, Gabriel Antoniu
2016 Computer Communications and Networks  
Acknowledgments The research leading to these results has received funding from the H2020 project reference number 642963 in the call H2020-MSCA-ITN-2014.  ...  [34] have derived a stochastic model that in some way predicts the performance of MapReduce applications under failures (crash failures).  ...  Each standby/slave node is registered to active/primary node and its initial metadata (such as version file and file system image) are caught up with those of active/primary node. • Replication phase.  ... 
doi:10.1007/978-3-319-44881-7_11 dblp:series/ccn/MemishiIPA16 fatcat:m5x33gpzunhzzgrdslagndiwzy

Exploring versioned distributed arrays for resilience in scientific applications

A Chien, P Balaji, N Dun, A Fang, H Fujita, K Iskra, Z Rubenstein, Z Zheng, J Hammond, I Laguna, D Richards, A Dubey (+5 others)
2016 The international journal of high performance computing applications  
This control is portable, and its embedding in application source makes it natural to express and easy to maintain.  ...  GVR's multi-version enables applications to survive latent errors (silent data corruption) with significant detection latency, and forward recovery can make that recovery extremely efficient.  ...  As shown on the right of Figure18 the application programmer would typically define an error recovery function to handle a class of errors and register it.  ... 
doi:10.1177/1094342016664796 fatcat:aaipn5vawrg4dhzka4rigj325y

A Fault-tolerance Linguistic Structure for Distributed Applications [article]

Vincenzo De Florio
2016 arXiv   pre-print
to address error recovery and reconfiguration.  ...  other and with respect to the above mentioned need.  ...  Lauwereins, whose availability, guidance, and support allowed me to develop my doctoral studies. Many thanks also for his important remarks on the contents of this thesis.  ... 
arXiv:1611.01690v1 fatcat:mtyx6ubjafhzter5zream33d5u

Addressing failures in exascale computing

Marc Snir, Robert W Wisniewski, Jacob A Abraham, Sarita V Adve, Saurabh Bagchi, Pavan Balaji, Jim Belak, Pradip Bose, Franck Cappello, Bill Carlson, Andrew A Chien, Paul Coteus (+16 others)
2014 The international journal of high performance computing applications  
Two key technologies are needed for this approach to be feasible. (1) low SDC frequency (same as now) and low frequency of system failures or an order of magnitude improvement in system recovery time.  ...  recovery from global system failures. • Silent data corruptions may become too frequent, and errors will not be detected in time. • The output of the application may be used in real time.  ...  Department of Energy for its financial support of ICiS; the ICiS director and steering committee for the support provided to our workshop; and, in particular, Cheryl Zidel for her outstanding administrative  ... 
doi:10.1177/1094342014522573 fatcat:menonpmgdfflzamz2fsivevxqm

Rolex: Resilience-Oriented Language Extensions for Extreme-Scale Systems [article]

Saurabh Hukerikar, Robert F. Lucas
2016 arXiv   pre-print
Many HPC applications are inherently fault resilient. Yet it is the application programmers who have this knowledge but lack mechanisms to convey it to the system.  ...  The mean time to failure (MTTF) of the system scales inversely to the number of components in the system and therefore faults and resultant system level failures will increase, as systems scale in terms  ...  pointer reference to a user-defined recovery function, which is registered with the runtime system when the memory block is allocated.  ... 
arXiv:1605.01994v2 fatcat:5sjyp32rsjbtvobybzei7ro22y

A survey of cross-layer power-reliability tradeoffs in multi and many core systems-on-chip

Ahmed A. Eltawil, Michael Engel, Bibiche Geuskens, Amin Khajeh Djahromi, Fadi J. Kurdahi, Peter Marwedel, Smail Niar, Mazen A.R. Saghir
2013 Microprocessors and microsystems  
In order to do so, it is crucial to understand the close interplay between the different layers of a system: technology, platform, and application.  ...  As systems-on-chip increase in complexity, the underlying technology presents us with significant challenges due to increased power consumption as well as decreased reliability.  ...  While this is true in case of static manufacturing faults, this model cannot be sustained as scaling progresses due to the random nature of the fluctuation of dopant atom distributions.  ... 
doi:10.1016/j.micpro.2013.07.008 fatcat:bl2v6dfvxnfxnble4pkcg2pcw4

IBM Deep Learning Service [article]

Bishwaranjan Bhattacharjee, Scott Boag, Chandani Doshi, Parijat Dube, Ben Herta, Vatche Ishakian, K. R. Jayaram, Rania Khalaf, Avesh Krishna, Yu Bo Li, Vinod Muthusamy, Ruchir Puri, Yufei Ren (+5 others)
2017 arXiv   pre-print
These two trends: deep learning, and "as-a-service" are colliding to give rise to a new business model for cognitive application delivery: deep learning as a service in the cloud.  ...  The platform uses a distribution and orchestration layer that facilitates learning from a large amount of data in a reasonable amount of time across compute nodes.  ...  Zookeeper itself is replicated (3-way) , and updates to its state are Atomic, strongly Consistent, Isolated and Durable (ACID) due to the use of Zookeeper atomic broadcast.  ... 
arXiv:1709.05871v1 fatcat:wgifwcuqjfghxj7ounvapkh3du

RAID: high-performance, reliable secondary storage

Peter M. Chen, Edward K. Lee, Garth A. Gibson, Randy H. Katz, David A. Patterson
1994 ACM Computing Surveys  
It goes on to discuss advanced research and implementation topics such as refining the basic RAID levels to improve performance and designing algorithms to maintain data consistency.  ...  This paper gives a comprehensive overview of disk arrays and provides a framework in which to organize current and future work.  ...  If writes do not have to be atomic, applications cannot assume either that the write during a system crash completed or did not complete, and thus it is generally permissible for the bit-interleaved disk  ... 
doi:10.1145/176979.176981 fatcat:5hb5vw56xrh2zo5cifiihoxlre
« Previous Showing results 1 — 15 out of 405 results