Filters








260,727 Hits in 4.5 sec

Machine learning based job status prediction in scientific clusters

Wucherl Yoo, Alex Sim, Kesheng Wu
<span title="">2016</span> <i title="IEEE"> 2016 SAI Computing Conference (SAI) </i> &nbsp;
This prediction accuracy can be sufficiently high that it can be used to mitigation procedures of predicted failures.  ...  To explore this possibility, we have developed a job status prediction method for the execution of jobs on scientific clusters.  ...  Failures are costly to users and systems since they waste time and system resources. Online failure prediction can mitigate these wastes by taking early actions for those predicted failures.  ... 
<span class="external-identifiers"> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.1109/sai.2016.7555961">doi:10.1109/sai.2016.7555961</a> <a target="_blank" rel="external noopener" href="https://fatcat.wiki/release/57nhkjow4zba5ex62cjri6e3x4">fatcat:57nhkjow4zba5ex62cjri6e3x4</a> </span>
<a target="_blank" rel="noopener" href="https://web.archive.org/web/20201107203133/https://escholarship.org/content/qt4wx6w700/qt4wx6w700.pdf" title="fulltext PDF download" data-goatcounter-click="serp-fulltext" data-goatcounter-title="serp-fulltext"> <button class="ui simple right pointing dropdown compact black labeled icon button serp-button"> <i class="icon ia-icon"></i> Web Archive [PDF] <div class="menu fulltext-thumbnail"> <img src="https://blobs.fatcat.wiki/thumbnail/pdf/e6/fa/e6fa57cb38f0b31bb86711151fbef5b5ff1f3a77.180px.jpg" alt="fulltext thumbnail" loading="lazy"> </div> </button> </a> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.1109/sai.2016.7555961"> <button class="ui left aligned compact blue labeled icon button serp-button"> <i class="external alternate icon"></i> ieee.com </button> </a>

Quantifying event correlations for proactive failure management in networked computing systems

Song Fu, Cheng-Zhong Xu
<span title="">2010</span> <i title="Elsevier BV"> <a target="_blank" rel="noopener" href="https://fatcat.wiki/container/rwjg7tprafhufajayuvxj2q4n4" style="color: black;">Journal of Parallel and Distributed Computing</a> </i> &nbsp;
We cluster failure events based on their correlations and predict their future occurrences.  ...  and capture failure correlations in a cluster coalition environment.  ...  A preliminary version of this paper was presented in the Proceedings of the 26th IEEE International Symposium on Reliable Distributed Systems (SRDS'07) [14] .  ... 
<span class="external-identifiers"> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.1016/j.jpdc.2010.06.010">doi:10.1016/j.jpdc.2010.06.010</a> <a target="_blank" rel="external noopener" href="https://fatcat.wiki/release/wwqpddtmmre7lhzbhttilmp2lm">fatcat:wwqpddtmmre7lhzbhttilmp2lm</a> </span>
<a target="_blank" rel="noopener" href="https://web.archive.org/web/20170705063123/http://ece.eng.wayne.edu/~czxu/paper/fu-jpdc2010.pdf" title="fulltext PDF download" data-goatcounter-click="serp-fulltext" data-goatcounter-title="serp-fulltext"> <button class="ui simple right pointing dropdown compact black labeled icon button serp-button"> <i class="icon ia-icon"></i> Web Archive [PDF] <div class="menu fulltext-thumbnail"> <img src="https://blobs.fatcat.wiki/thumbnail/pdf/19/cd/19cdfc2e38fbab789f123e28a504fd90f5cdfba8.180px.jpg" alt="fulltext thumbnail" loading="lazy"> </div> </button> </a> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.1016/j.jpdc.2010.06.010"> <button class="ui left aligned compact blue labeled icon button serp-button"> <i class="external alternate icon"></i> elsevier.com </button> </a>

A Meta-Learning Failure Predictor for Blue Gene/L Systems

Prashasta Gujrati, Yawei Li, Zhiling Lan, Rajeev Thakur, John White
<span title="">2007</span> <i title="IEEE"> <a target="_blank" rel="noopener" href="https://fatcat.wiki/container/kpclxvevxnc5li5z6asmi6i6ve" style="color: black;">Proceedings of the International Conference on Parallel Processing</a> </i> &nbsp;
Successful prediction of potential failures can greatly enhance various fault tolerance mechanisms used in large clusters, thereby mitigating the adverse impact of failures on system productivity and total  ...  In this paper, we present a three-phase failure predictor to automatically process RAS events and further discover failure patterns for prediction in Blue Gene/L systems.  ...  Further, the proposed meta-learning mechanism should be further examined for advancing failure prediction in large clusters.  ... 
<span class="external-identifiers"> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.1109/icpp.2007.9">doi:10.1109/icpp.2007.9</a> <a target="_blank" rel="external noopener" href="https://dblp.org/rec/conf/icpp/GujratiLLTW07.html">dblp:conf/icpp/GujratiLLTW07</a> <a target="_blank" rel="external noopener" href="https://fatcat.wiki/release/2lhdv2ztrze3divbk2zjcuteqy">fatcat:2lhdv2ztrze3divbk2zjcuteqy</a> </span>
<a target="_blank" rel="noopener" href="https://web.archive.org/web/20170812042357/http://www.cs.iit.edu/~zlan/publications/icpp07_meta.pdf" title="fulltext PDF download" data-goatcounter-click="serp-fulltext" data-goatcounter-title="serp-fulltext"> <button class="ui simple right pointing dropdown compact black labeled icon button serp-button"> <i class="icon ia-icon"></i> Web Archive [PDF] <div class="menu fulltext-thumbnail"> <img src="https://blobs.fatcat.wiki/thumbnail/pdf/8e/c2/8ec27911cc66893c2d2d62938121ad07926c2f6a.180px.jpg" alt="fulltext thumbnail" loading="lazy"> </div> </button> </a> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.1109/icpp.2007.9"> <button class="ui left aligned compact blue labeled icon button serp-button"> <i class="external alternate icon"></i> ieee.com </button> </a>

A Virtual Server QoS Enhancement Method in Cloud Computing

Berihun Fekade, Taras Maksymyuk, Minho Jo
<span title="">2016</span> <i title="ACM Press"> <a target="_blank" rel="noopener" href="https://fatcat.wiki/container/pwky5eqcwvht7daxecjq73vuly" style="color: black;">Proceedings of the 10th International Conference on Ubiquitous Information Management and Communication - IMCOM &#39;16</a> </i> &nbsp;
In this paper, we propose a failure prediction and prevention model of hypervisors by using Bayes naive classifier.  ...  In such conditions, system reliability becomes a challenging task for cloud computing.  ...  We group hypervisors in order for fail safe mechanism before we apply Bayesian classifier. Our proposed system is based on grouping hypervisors for fail safe mechanism.  ... 
<span class="external-identifiers"> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.1145/2857546.2857629">doi:10.1145/2857546.2857629</a> <a target="_blank" rel="external noopener" href="https://dblp.org/rec/conf/icuimc/FekadeMJ16.html">dblp:conf/icuimc/FekadeMJ16</a> <a target="_blank" rel="external noopener" href="https://fatcat.wiki/release/xxxxst3eezesngrzpvv2dav3d4">fatcat:xxxxst3eezesngrzpvv2dav3d4</a> </span>
<a target="_blank" rel="noopener" href="https://web.archive.org/web/20200321092121/http://iot.korea.ac.kr/Member/prof_jo/A%20Virtual%20Server%20QoS%20Enhancement%20Method%20in%20Cloud%20Computing.pdf" title="fulltext PDF download" data-goatcounter-click="serp-fulltext" data-goatcounter-title="serp-fulltext"> <button class="ui simple right pointing dropdown compact black labeled icon button serp-button"> <i class="icon ia-icon"></i> Web Archive [PDF] <div class="menu fulltext-thumbnail"> <img src="https://blobs.fatcat.wiki/thumbnail/pdf/e6/fc/e6fc9a1d3b57aa574c1a2d47f9371919f556b18b.180px.jpg" alt="fulltext thumbnail" loading="lazy"> </div> </button> </a> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.1145/2857546.2857629"> <button class="ui left aligned compact blue labeled icon button serp-button"> <i class="external alternate icon"></i> acm.org </button> </a>

Self-organization of dragon king failures

Yuansheng Lin, Keith Burghardt, Martin Rohden, Pierre-André Noël, Raissa M. D'Souza
<span title="2018-08-27">2018</span> <i title="American Physical Society (APS)"> <a target="_blank" rel="noopener" href="https://fatcat.wiki/container/5pgnwkvlevhybgsbuc3q7fchx4" style="color: black;">Physical review. E</a> </i> &nbsp;
In our model, we find that once an initial failure size is above a critical value, the Dragon King mechanism kicks in, leading to piggybacking system-wide failures.  ...  In contrast, if strong nodes fail once a sufficient fraction of their neighbors fail, this leads to "Dragon Kings", which are massive failures caused by mechanisms distinct from smaller failures.  ...  as small events (i.e., the mechanism being the failure of a single cluster of weak nodes in the IN model.)  ... 
<span class="external-identifiers"> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.1103/physreve.98.022127">doi:10.1103/physreve.98.022127</a> <a target="_blank" rel="external noopener" href="https://fatcat.wiki/release/ivn2dqkfsvazje3hvorxk2szou">fatcat:ivn2dqkfsvazje3hvorxk2szou</a> </span>
<a target="_blank" rel="noopener" href="https://web.archive.org/web/20200307191435/https://link.aps.org/accepted/10.1103/PhysRevE.98.022127" title="fulltext PDF download" data-goatcounter-click="serp-fulltext" data-goatcounter-title="serp-fulltext"> <button class="ui simple right pointing dropdown compact black labeled icon button serp-button"> <i class="icon ia-icon"></i> Web Archive [PDF] <div class="menu fulltext-thumbnail"> <img src="https://blobs.fatcat.wiki/thumbnail/pdf/30/2b/302bddab895bb9b8b6a0dfa19a902718304724a3.180px.jpg" alt="fulltext thumbnail" loading="lazy"> </div> </button> </a> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.1103/physreve.98.022127"> <button class="ui left aligned compact blue labeled icon button serp-button"> <i class="external alternate icon"></i> aps.org </button> </a>

An online failure prediction system for private IaaS platforms

Pedro Capelastegui, Alvaro Navas, Francisco Huertas, Rodrigo Garcia-Carmona, Juan Carlos Dueñas
<span title="">2013</span> <i title="ACM Press"> Proceedings of the 2nd International Workshop on Dependability Issues in Cloud Computing - DISCCO &#39;13 </i> &nbsp;
In addition, this system operates at both the physical and virtual planes of the cloud, taking into account the relationships between nodes and failure propagation mechanisms that are unique to cloud environments  ...  A more proactive approach is provided by online failure prediction (OFP) techniques.  ...  Events Predict-E VM Figure 3 : Predictor Server in Application Manager prediction mechanisms, based on the choice of input data: monitoring based prediction, which periodically examines monitored system  ... 
<span class="external-identifiers"> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.1145/2506155.2506159">doi:10.1145/2506155.2506159</a> <a target="_blank" rel="external noopener" href="https://fatcat.wiki/release/bjut5ee3prgqtchdditwlc4u5i">fatcat:bjut5ee3prgqtchdditwlc4u5i</a> </span>
<a target="_blank" rel="noopener" href="https://web.archive.org/web/20170705081228/http://oa.upm.es/25767/1/INVE_MEM_2013_160484.pdf" title="fulltext PDF download" data-goatcounter-click="serp-fulltext" data-goatcounter-title="serp-fulltext"> <button class="ui simple right pointing dropdown compact black labeled icon button serp-button"> <i class="icon ia-icon"></i> Web Archive [PDF] <div class="menu fulltext-thumbnail"> <img src="https://blobs.fatcat.wiki/thumbnail/pdf/c4/c2/c4c21675b9a21e475628a554e38b9e6952586768.180px.jpg" alt="fulltext thumbnail" loading="lazy"> </div> </button> </a> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.1145/2506155.2506159"> <button class="ui left aligned compact blue labeled icon button serp-button"> <i class="external alternate icon"></i> acm.org </button> </a>

Failure-Aware Construction and Reconfiguration of Distributed Virtual Machines for High Availability Computing

Song Fu
<span title="">2009</span> <i title="IEEE"> <a target="_blank" rel="noopener" href="https://fatcat.wiki/container/ujjptpi7mjgmfdad3mk3fcer3y" style="color: black;">2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid</a> </i> &nbsp;
In large-scale clusters and computational grids, component failures become norms instead of exceptions.  ...  In this paper, we study how to efficiently utilize system resources for high-availability clusters with the support of the virtual machine (VM) technology.  ...  This research was supported in part by U.S. IAS LANL grant.  ... 
<span class="external-identifiers"> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.1109/ccgrid.2009.21">doi:10.1109/ccgrid.2009.21</a> <a target="_blank" rel="external noopener" href="https://dblp.org/rec/conf/ccgrid/Fu09.html">dblp:conf/ccgrid/Fu09</a> <a target="_blank" rel="external noopener" href="https://fatcat.wiki/release/fgw7yurkljgo7eogwoitwhkwf4">fatcat:fgw7yurkljgo7eogwoitwhkwf4</a> </span>
<a target="_blank" rel="noopener" href="https://web.archive.org/web/20170829094726/http://www.cse.unt.edu/~song/Publications/FADVM-ccgrid09.pdf" title="fulltext PDF download" data-goatcounter-click="serp-fulltext" data-goatcounter-title="serp-fulltext"> <button class="ui simple right pointing dropdown compact black labeled icon button serp-button"> <i class="icon ia-icon"></i> Web Archive [PDF] <div class="menu fulltext-thumbnail"> <img src="https://blobs.fatcat.wiki/thumbnail/pdf/ab/17/ab174aae710d05f7d20c7d2b3edf198ead9cf02e.180px.jpg" alt="fulltext thumbnail" loading="lazy"> </div> </button> </a> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.1109/ccgrid.2009.21"> <button class="ui left aligned compact blue labeled icon button serp-button"> <i class="external alternate icon"></i> ieee.com </button> </a>

Topology-aware reliability optimization for multiprocessor systems

Jie Meng, Fulya Kaplan, Mingyu Hsieh, Ayse K. Coskun
<span title="">2012</span> <i title="IEEE"> 2012 IEEE/IFIP 20th International Conference on VLSI and System-on-Chip (VLSI-SoC) </i> &nbsp;
In this work, we propose a topology-aware workload allocation policy to optimize the reliability of multi-chip multicore systems at runtime.  ...  Our results show that the proposed policy improves the system reliability by up to 123.3% compared to existing temperature balancing policies when systems have medium to high utilization.  ...  Failure rates for these three failure mechanisms can be expressed in the following general form: (1) where E a is the activation energy for the failure mechanism, k is the Boltzmann's constant (8.62  ... 
<span class="external-identifiers"> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.1109/vlsi-soc.2012.7332108">doi:10.1109/vlsi-soc.2012.7332108</a> <a target="_blank" rel="external noopener" href="https://fatcat.wiki/release/e3syinnh55g5pmvhltknyhluhm">fatcat:e3syinnh55g5pmvhltknyhluhm</a> </span>
<a target="_blank" rel="noopener" href="https://web.archive.org/web/20170811231609/http://www.bu.edu/peaclab/files/2014/03/meng_VLSISOC12.pdf" title="fulltext PDF download" data-goatcounter-click="serp-fulltext" data-goatcounter-title="serp-fulltext"> <button class="ui simple right pointing dropdown compact black labeled icon button serp-button"> <i class="icon ia-icon"></i> Web Archive [PDF] <div class="menu fulltext-thumbnail"> <img src="https://blobs.fatcat.wiki/thumbnail/pdf/5d/67/5d67557f90275672802a8228834dd8ff3eb7d6e0.180px.jpg" alt="fulltext thumbnail" loading="lazy"> </div> </button> </a> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.1109/vlsi-soc.2012.7332108"> <button class="ui left aligned compact blue labeled icon button serp-button"> <i class="external alternate icon"></i> ieee.com </button> </a>

Topology-aware reliability optimization for multiprocessor systems

Jie Meng, Fulya Kaplan, Mingyu Hsieh, Ayse K. Coskun
<span title="">2012</span> <i title="IEEE"> <a target="_blank" rel="noopener" href="https://fatcat.wiki/container/yspntyokqnedlhxebffqxpdqd4" style="color: black;">2012 IEEE/IFIP 20th International Conference on VLSI and System-on-Chip (VLSI-SoC)</a> </i> &nbsp;
In this work, we propose a topology-aware workload allocation policy to optimize the reliability of multi-chip multicore systems at runtime.  ...  Our results show that the proposed policy improves the system reliability by up to 123.3% compared to existing temperature balancing policies when systems have medium to high utilization.  ...  Failure rates for these three failure mechanisms can be expressed in the following general form: (1) where E a is the activation energy for the failure mechanism, k is the Boltzmann's constant (8.62  ... 
<span class="external-identifiers"> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.1109/vlsi-soc.2012.6379037">doi:10.1109/vlsi-soc.2012.6379037</a> <a target="_blank" rel="external noopener" href="https://dblp.org/rec/conf/vlsi/MengKHC12.html">dblp:conf/vlsi/MengKHC12</a> <a target="_blank" rel="external noopener" href="https://fatcat.wiki/release/swjo3cf7bje3zoc4o72vmh3b2m">fatcat:swjo3cf7bje3zoc4o72vmh3b2m</a> </span>
<a target="_blank" rel="noopener" href="https://web.archive.org/web/20170811231609/http://www.bu.edu/peaclab/files/2014/03/meng_VLSISOC12.pdf" title="fulltext PDF download" data-goatcounter-click="serp-fulltext" data-goatcounter-title="serp-fulltext"> <button class="ui simple right pointing dropdown compact black labeled icon button serp-button"> <i class="icon ia-icon"></i> Web Archive [PDF] <div class="menu fulltext-thumbnail"> <img src="https://blobs.fatcat.wiki/thumbnail/pdf/5d/67/5d67557f90275672802a8228834dd8ff3eb7d6e0.180px.jpg" alt="fulltext thumbnail" loading="lazy"> </div> </button> </a> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.1109/vlsi-soc.2012.6379037"> <button class="ui left aligned compact blue labeled icon button serp-button"> <i class="external alternate icon"></i> ieee.com </button> </a>

Quantifying Temporal and Spatial Correlation of Failure Events for Proactive Management

Song Fu, Cheng-Zhong Xu
<span title="">2007</span> <i title="IEEE"> <a target="_blank" rel="noopener" href="https://fatcat.wiki/container/3s3htdviunfcnirqej3sjiioma" style="color: black;">Symposium on Reliable Distributed Systems. Proceedings</a> </i> &nbsp;
capture failure correlations in cluster coalition environment. 26th IEEE International Symposium on Reliable Distributed Systems  ...  We cluster failure events based on their correlations and predict their future occurrences.  ...  We would also like to thank Philip Sokolowski and Michael thompson for their kind help in data collection from the Wayne State Grid. This research was supported in part by U.S.  ... 
<span class="external-identifiers"> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.1109/srds.2007.4365694">doi:10.1109/srds.2007.4365694</a> <a target="_blank" rel="external noopener" href="https://fatcat.wiki/release/5awp4kvtoffynjh4j6n4bby3hm">fatcat:5awp4kvtoffynjh4j6n4bby3hm</a> </span>
<a target="_blank" rel="noopener" href="https://web.archive.org/web/20090530021900/http://www.cs.nmt.edu:80/~song/Publications/FC-srds07.pdf" title="fulltext PDF download" data-goatcounter-click="serp-fulltext" data-goatcounter-title="serp-fulltext"> <button class="ui simple right pointing dropdown compact black labeled icon button serp-button"> <i class="icon ia-icon"></i> Web Archive [PDF] <div class="menu fulltext-thumbnail"> <img src="https://blobs.fatcat.wiki/thumbnail/pdf/e9/26/e926e7ef53b4139275d6298f02b41d4614ab025f.180px.jpg" alt="fulltext thumbnail" loading="lazy"> </div> </button> </a> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.1109/srds.2007.4365694"> <button class="ui left aligned compact blue labeled icon button serp-button"> <i class="external alternate icon"></i> ieee.com </button> </a>

Quantifying Temporal and Spatial Correlation of Failure Events for Proactive Management

Song Fu, Cheng-Zhong Xu
<span title="">2007</span> <i title="IEEE"> <a target="_blank" rel="noopener" href="https://fatcat.wiki/container/vb3ehoarofgm5p24f2dqjxmp3a" style="color: black;">2007 26th IEEE International Symposium on Reliable Distributed Systems (SRDS 2007)</a> </i> &nbsp;
capture failure correlations in cluster coalition environment. 26th IEEE International Symposium on Reliable Distributed Systems  ...  We cluster failure events based on their correlations and predict their future occurrences.  ...  We would also like to thank Philip Sokolowski and Michael thompson for their kind help in data collection from the Wayne State Grid. This research was supported in part by U.S.  ... 
<span class="external-identifiers"> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.1109/srds.2007.18">doi:10.1109/srds.2007.18</a> <a target="_blank" rel="external noopener" href="https://dblp.org/rec/conf/srds/FuX07.html">dblp:conf/srds/FuX07</a> <a target="_blank" rel="external noopener" href="https://fatcat.wiki/release/ktwqpaljsfcwpfyu4ynotns634">fatcat:ktwqpaljsfcwpfyu4ynotns634</a> </span>
<a target="_blank" rel="noopener" href="https://web.archive.org/web/20090530021900/http://www.cs.nmt.edu:80/~song/Publications/FC-srds07.pdf" title="fulltext PDF download" data-goatcounter-click="serp-fulltext" data-goatcounter-title="serp-fulltext"> <button class="ui simple right pointing dropdown compact black labeled icon button serp-button"> <i class="icon ia-icon"></i> Web Archive [PDF] <div class="menu fulltext-thumbnail"> <img src="https://blobs.fatcat.wiki/thumbnail/pdf/e9/26/e926e7ef53b4139275d6298f02b41d4614ab025f.180px.jpg" alt="fulltext thumbnail" loading="lazy"> </div> </button> </a> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.1109/srds.2007.18"> <button class="ui left aligned compact blue labeled icon button serp-button"> <i class="external alternate icon"></i> ieee.com </button> </a>

Learning Towards Failure Prediction of High Performance Computing Clusters by Employing LSTM

<span title="2019-08-30">2019</span> <i title="Blue Eyes Intelligence Engineering and Sciences Engineering and Sciences Publication - BEIESP"> <a target="_blank" rel="noopener" href="https://fatcat.wiki/container/h673cvfolnhl3mnbjxkhtxdtg4" style="color: black;">International Journal of Engineering and Advanced Technology</a> </i> &nbsp;
This Failure prediction of high-performance computing clusters (HPCC) is a crucial issue and a hot problem for many years.  ...  We have employed the concept of long short-term memory (LSTM) with reinforcement learning to correct the prediction accuracy in real-time and provide a solution to the industry with reliable results  ...  They tested their system on a Linux cluster and stated 73% accuracy for advance failure prediction.  ... 
<span class="external-identifiers"> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.35940/ijeat.f7885.088619">doi:10.35940/ijeat.f7885.088619</a> <a target="_blank" rel="external noopener" href="https://fatcat.wiki/release/kjgttkraxrelrbvfnjrsg7scnm">fatcat:kjgttkraxrelrbvfnjrsg7scnm</a> </span>
<a target="_blank" rel="noopener" href="https://web.archive.org/web/20200207191640/https://www.ijeat.org/wp-content/uploads/papers/v8i6/F7885088619.pdf" title="fulltext PDF download" data-goatcounter-click="serp-fulltext" data-goatcounter-title="serp-fulltext"> <button class="ui simple right pointing dropdown compact black labeled icon button serp-button"> <i class="icon ia-icon"></i> Web Archive [PDF] <div class="menu fulltext-thumbnail"> <img src="https://blobs.fatcat.wiki/thumbnail/pdf/65/b9/65b936ed0e18c9deb8bb6a24364428dffa721b1c.180px.jpg" alt="fulltext thumbnail" loading="lazy"> </div> </button> </a> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.35940/ijeat.f7885.088619"> <button class="ui left aligned compact blue labeled icon button serp-button"> <i class="unlock alternate icon" style="background-color: #fb971f;"></i> Publisher / doi.org </button> </a>

A Data-Driven Approach to Dynamically Adjust Resource Allocation for Compute Clusters [article]

Francesco Pace, Dimitrios Milios, Damiano Carra, Daniele Venzano and Pietro Michiardi
<span title="2018-07-01">2018</span> <i > arXiv </i> &nbsp; <span class="release-stage" >pre-print</span>
In this work, we propose a mechanism that improves cluster utilization, thus decreasing the average turnaround time, while preventing application failures due to contention in accessing finite resources  ...  Thus, tenants enjoy a responsive system and providers benefit from an efficient cluster utilization.  ...  In this paper we present our design of a data-driven resource shaping mechanism that improves cluster utilization, thus decreasing the average turnaround time, while preventing application failures due  ... 
<span class="external-identifiers"> <a target="_blank" rel="external noopener" href="https://arxiv.org/abs/1807.00368v1">arXiv:1807.00368v1</a> <a target="_blank" rel="external noopener" href="https://fatcat.wiki/release/7byxnyhnkjatvnx4hio5qvg2e4">fatcat:7byxnyhnkjatvnx4hio5qvg2e4</a> </span>
<a target="_blank" rel="noopener" href="https://web.archive.org/web/20200823033305/https://arxiv.org/pdf/1807.00368v1.pdf" title="fulltext PDF download" data-goatcounter-click="serp-fulltext" data-goatcounter-title="serp-fulltext"> <button class="ui simple right pointing dropdown compact black labeled icon button serp-button"> <i class="icon ia-icon"></i> Web Archive [PDF] <div class="menu fulltext-thumbnail"> <img src="https://blobs.fatcat.wiki/thumbnail/pdf/37/51/3751387ea9a82cc7867d4bb47cb5a4fb4f1ff233.180px.jpg" alt="fulltext thumbnail" loading="lazy"> </div> </button> </a> <a target="_blank" rel="external noopener" href="https://arxiv.org/abs/1807.00368v1" title="arxiv.org access"> <button class="ui compact blue labeled icon button serp-button"> <i class="file alternate outline icon"></i> arxiv.org </button> </a>

Exploit failure prediction for adaptive fault-tolerance in cluster computing

Yawei Li, Zhiling Lan
<span title="">2006</span> <i title="IEEE"> <a target="_blank" rel="noopener" href="https://fatcat.wiki/container/ujjptpi7mjgmfdad3mk3fcer3y" style="color: black;">Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID&#39;06)</a> </i> &nbsp;
based on the failure prediction.  ...  In this work, we propose FT-Pro, an adaptive fault management mechanism that optimally chooses migration, checkpointing or no action to reduce the application execution time in the presence of failures  ...  We also would like to thank Charng-da Lu at University of Illinois at Urbana-Champaign for the Platinum failure log.  ... 
<span class="external-identifiers"> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.1109/ccgrid.2006.45">doi:10.1109/ccgrid.2006.45</a> <a target="_blank" rel="external noopener" href="https://dblp.org/rec/conf/ccgrid/LiL06.html">dblp:conf/ccgrid/LiL06</a> <a target="_blank" rel="external noopener" href="https://fatcat.wiki/release/mwllmmwnmfgg7hliqdrg3crlcq">fatcat:mwllmmwnmfgg7hliqdrg3crlcq</a> </span>
<a target="_blank" rel="noopener" href="https://web.archive.org/web/20081031190401/http://www.cs.iit.edu/~zlan/publications/ccgrid06_lan.pdf" title="fulltext PDF download" data-goatcounter-click="serp-fulltext" data-goatcounter-title="serp-fulltext"> <button class="ui simple right pointing dropdown compact black labeled icon button serp-button"> <i class="icon ia-icon"></i> Web Archive [PDF] <div class="menu fulltext-thumbnail"> <img src="https://blobs.fatcat.wiki/thumbnail/pdf/c9/f9/c9f9b150d31d169d2eeec2a53cc4d7a198cecbd6.180px.jpg" alt="fulltext thumbnail" loading="lazy"> </div> </button> </a> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.1109/ccgrid.2006.45"> <button class="ui left aligned compact blue labeled icon button serp-button"> <i class="external alternate icon"></i> ieee.com </button> </a>

Monitoring and Predicting Hardware Failures in HPC Clusters with FTB-IPMI

Raghunath Rajachandrasekar, Xavier Besseron, Dhabaleswar K. Panda
<span title="">2012</span> <i title="IEEE"> <a target="_blank" rel="noopener" href="https://fatcat.wiki/container/t3x4vqewrncrfgn2wu7cafsbsq" style="color: black;">2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops &amp; PhD Forum</a> </i> &nbsp;
Fault-detection and prediction in HPC clusters and Cloud-computing systems are increasingly challenging issues.  ...  A deployment of FTB-IPMI that services a cluster with 128 compute-nodes, sweeps the entire cluster and collects IPMI sensor information on CPU temperature, system voltages and fan speeds in about 0.75  ...  Mark Arnold, our Systems Manager, for his assistance during the development and testing of FTB-IPMI, the anonymous reviewers for their constructive suggestions, and the members of the CIFTS team for valuable  ... 
<span class="external-identifiers"> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.1109/ipdpsw.2012.139">doi:10.1109/ipdpsw.2012.139</a> <a target="_blank" rel="external noopener" href="https://dblp.org/rec/conf/ipps/RajachandrasekarBP12.html">dblp:conf/ipps/RajachandrasekarBP12</a> <a target="_blank" rel="external noopener" href="https://fatcat.wiki/release/e5uqu7whnbdtrcw6pej4ftolji">fatcat:e5uqu7whnbdtrcw6pej4ftolji</a> </span>
<a target="_blank" rel="noopener" href="https://web.archive.org/web/20160605092712/http://rajachan.com/papers/smtps12.pdf" title="fulltext PDF download" data-goatcounter-click="serp-fulltext" data-goatcounter-title="serp-fulltext"> <button class="ui simple right pointing dropdown compact black labeled icon button serp-button"> <i class="icon ia-icon"></i> Web Archive [PDF] <div class="menu fulltext-thumbnail"> <img src="https://blobs.fatcat.wiki/thumbnail/pdf/f1/4f/f14f5cd0cd5e5d839752d383aff524d59ef4513b.180px.jpg" alt="fulltext thumbnail" loading="lazy"> </div> </button> </a> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.1109/ipdpsw.2012.139"> <button class="ui left aligned compact blue labeled icon button serp-button"> <i class="external alternate icon"></i> ieee.com </button> </a>
&laquo; Previous Showing results 1 &mdash; 15 out of 260,727 results