A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2017; you can also visit the original URL.
The file type is application/pdf
.
Filters
BlueGene/L applications: Parallelism On a Massive Scale
2008
The international journal of high performance computing applications
BlueGene/L (BG/L), developed through a partnership between IBM and Lawrence Livermore National Laboratory (LLNL), is currently the world's largest system both in terms of scale, with 131,072 processors ...
BG/L has led the last four Top500 lists with a Linpack rate of 280.6 Tflop/s for the full machine installed at LLNL and is expected to remain the fastest computer in the next few editions. ...
for parallel programming; and fault tolerance at the application and system level. ...
doi:10.1177/1094342007085025
fatcat:s5h4ai3mvvciploic4ljt7n434
BlueGene/L Failure Analysis and Prediction Models
International Conference on Dependable Systems and Networks (DSN'06)
In this study, we have collected RAS event logs from BlueGene/L over a period of more than 100 days. ...
The growing computational and storage needs of several scientific applications mandate the deployment of extreme-scale parallel machines, such as IBM's BlueGene/L which can accommodate as many as 128K ...
as IBM BlueGene/L. ...
doi:10.1109/dsn.2006.18
dblp:conf/dsn/LiangZSJS06
fatcat:rctxd37jkfb6fiauia3vxffssi
Filtering Failure Logs for a BlueGene/L Prototype
2005 International Conference on Dependable Systems and Networks (DSN'05)
The growing computational and storage needs of several scientific applications mandate the deployment of extremescale parallel machines, such as IBM's BlueGene/L which can accommodate as many as 128K processors ...
In this paper, we present our experiences in collecting and filtering error event logs from a 8192 processor BlueGene/L prototype at IBM Rochester, which is currently ranked #8 in the Top-500 list. ...
Although fault-aware design has gained importance for uniprocessor and small-scale systems, the problem escalates to a much higher magnitude when we move to the large scale parallel systems. ...
doi:10.1109/dsn.2005.50
dblp:conf/dsn/LiangZSSMG05
fatcat:hribyyz6pnh57dhgpoqiewpmcu
A global operating system for HPC clusters
2009
2009 IEEE International Conference on Cluster Computing and Workshops
These nodes run independent operating system kernels, thus synchronization among them is demanded for user mode programs. This means that temporal synchronization of the nodes is a daunting task. ...
tailored for HPC applications. ...
BlueGene/L (and later its successor, BlueGene/P) is the first example in this direction: the timers of each node of the supercomputer are continuously synchronized and, as a result, the "time" is the same ...
doi:10.1109/clustr.2009.5289191
dblp:conf/cluster/BettiCGP09
fatcat:ncceam2hvvdexbs3z6zc36rvxe
Performance under failures of high-end computing
2007
Proceedings of the 2007 ACM/IEEE conference on Supercomputing - SC '07
To benefit the end users of a given computing platform, we then develop effective fault-aware task scheduling algorithms to optimize application performance under system failures. ...
A better understanding of faults' influence on application performance is necessary to use existing fault tolerant methods wisely. ...
The potential resource failure probability has been considered in task scheduling in BlueGene/L systems [12] . These works demonstrate the significance of fault-aware task scheduling. ...
doi:10.1145/1362622.1362687
dblp:conf/sc/WuSJ07
fatcat:udidiazmc5dc5poiixo4ektqxm
Predictive Reliability and Fault Management in Exascale Systems
2020
ACM Computing Surveys
Such ideas have been used by [116] to predict failures of the IBM's BlueGene/L system. ...
For instance, in the IBM BlueGene/L supercomputer, a job experiencing two non-fatal events has a higher chance to experience a failure (above 5x) than if it only experiences one [116] . ...
doi:10.1145/3403956
fatcat:77xcpnevmnc5jfpj6ynhwdng3m
MCREngine: A scalable checkpointing system using data-aware aggregation and compression
2012
2012 International Conference for High Performance Computing, Networking, Storage and Analysis
As applications scale up, checkpoint-restart incurs high overheads due to contention for PFS resources. ...
High performance computing (HPC) systems use checkpoint-restart to tolerate failures. Typically, applications store their states in checkpoints on a parallel file system (PFS). ...
For example, the 100,000 node BlueGene/L system at Lawrence Livermore National Laboratory (LLNL) experiences an L1 cache parity error every 8 hours [1] and a hard failure every 7-10 days. ...
doi:10.1109/sc.2012.77
dblp:conf/sc/IslamMBMSE12
fatcat:42zz42q4onbs5l6w4xliop4jeq
McrEngine: A Scalable Checkpointing System Using Data-Aware Aggregation and Compression
2013
Scientific Programming
As applications scale up, checkpoint-restart incurs high overheads due to contention for PFS resources. ...
High performance computing (HPC) systems use checkpoint-restart to tolerate failures. Typically, applications store their states in checkpoints on a parallel file system (PFS). ...
For example, the 100,000 node BlueGene/L system at Lawrence Livermore National Laboratory (LLNL) experiences an L1 cache parity error every 8 hours [1] and a hard failure every 7-10 days. ...
doi:10.1155/2013/341672
fatcat:uqbxsvmeubdclezd6lbxd5f63a
High-performance computing systems: Status and outlook
2012
Acta Numerica
In addition, we discuss the requirements for software that can take advantage of existing and future architectures. ...
This article describes the current state of the art of high-performance computing systems, and attempts to shed light on near-future developments that might prolong the steady growth in speed of such systems ...
similar to those applied to the PPC 440 in the BlueGene/L. ...
doi:10.1017/s0962492912000050
fatcat:n6yodkox5zb6xmlep6gvayud2m
Performance Implications of Failures in Large-Scale Cluster Scheduling
[chapter]
2005
Lecture Notes in Computer Science
performance for a wide range of scheduling policies. ...
On the other hand, we demonstrate that information about the spatial and temporal correlation of failure occurrences can be very useful in designing a scheduling (job allocation) strategy to enhance system ...
At the same time, scheduling can be used to improve the fault-tolerance [1, 27] of a system in three broad ways. ...
doi:10.1007/11407522_13
fatcat:zk3xx6rlgvderchqeh6ir34rca
Software challenges in extreme scale systems
2009
Journal of Physics, Conference Series
Carlson is a member of the research staff at the IDA Center for Computing Sciences where, since 1990, his focus has been on applications and system tools for large-scale parallel and distributed computers ...
, for a range of real system applications, from highly scalable deep space exploration to trans-petaflops level supercomputing. ...
"peak performance" with over 200 Tflop/s sustained performance (56% efficiency) on the LLNL BlueGene/L [9]. ...
doi:10.1088/1742-6596/180/1/012045
fatcat:iukutry2dvbitfdh6ng7kgz564
Building Fuel Powered Supercomputing Data Center at Low Cost
2015
Proceedings of the 29th ACM on International Conference on Supercomputing - ICS '15
Other than dispatch computing tasks in bulk without considering power system behaviors, μBatch intelligently splits job queue into small sets and incrementally schedule jobs based on the power ramping ...
Distributed power generations that fed with various economical clean fuels are emerging as promising power supplies for extremescale computing systems. ...
ACKNOWLEDGEMENT We thank the anonymous reviewers for their valuable comments. ...
doi:10.1145/2751205.2751215
dblp:conf/ics/HuaLTJL15
fatcat:guwn2xxbmvapnlwenf5mcamwyq
Palacios and Kitten: New high performance operating systems for scalable virtualized and native supercomputing
2010
2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS)
This new architecture provides an incremental path for applications to use supercomputers, running specialized lightweight host operating systems, that is not significantly performance-compromised. ...
Our benchmarks show that Palacios provides near native (within 5%), scalable performance for virtualized environments running important parallel applications. ...
Other examples of this approach are the efforts to port Linux to the IBM BlueGene/L and BlueGene/P systems [22] , [23] . ...
doi:10.1109/ipdps.2010.5470482
dblp:conf/ipps/LangePHDCXBGJLB10
fatcat:pfcr3drdhzarxkvsuu436r33am
Efficient subtorus processor allocation in a multi-dimensional torus
2005
Eighth International Conference on High-Performance Computing in Asia-Pacific Region (HPCASIA'05)
The simulation results show that our algorithms (especially the backfilling/NEP combination) are capable of producing schedules with system utilization and mean job bounded slowdowns comparable to those ...
Specifically, our simulation experiments compare four algorithm combinations, FCFS/EP, FCFS/NEP, backfilling/EP, and backfilling/NEP, for two existing multi-dimensional torus connected systems. ...
/L [11] (also a 3-D torus). ...
doi:10.1109/hpcasia.2005.35
fatcat:u2fmauv3pbfrhi676f2ndwi5ti
Toward Exascale Resilience
2009
The international journal of high performance computing applications
This set of projections leaves the community of fault tolerance for HPC system with a difficult challenge: finding new approaches, possibility radically disruptive, to run applications until their normal ...
From the current knowledge and observations of existing large systems, it is anticipated that Exascale systems will experience various kind of faults many times per day. ...
Some recent work [GCG07] on BlueGene/L suggests that simple dedicated approaches (in terms of time and energy), could solve specific, well understood faults (errors in the L1 cache in the case of BlueGene ...
doi:10.1177/1094342009347767
fatcat:s7i4a7aocnckzka4bxsyzbg6qi
« Previous
Showing results 1 — 15 out of 41 results