Filters








41 Hits in 5.3 sec

BlueGene/L applications: Parallelism On a Massive Scale

Bronis R. de Supinski, Martin Schulz, Vasily V. Bulatov, William Cabot, Bor Chan, Andrew W. Cook, Erik W. Draeger, James N. Glosli, Jeffrey A. Greenough, Keith Henderson, Alison Kubota, Steve Louis (+30 others)
2008 The international journal of high performance computing applications  
BlueGene/L (BG/L), developed through a partnership between IBM and Lawrence Livermore National Laboratory (LLNL), is currently the world's largest system both in terms of scale, with 131,072 processors  ...  BG/L has led the last four Top500 lists with a Linpack rate of 280.6 Tflop/s for the full machine installed at LLNL and is expected to remain the fastest computer in the next few editions.  ...  for parallel programming; and fault tolerance at the application and system level.  ... 
doi:10.1177/1094342007085025 fatcat:s5h4ai3mvvciploic4ljt7n434

BlueGene/L Failure Analysis and Prediction Models

Yinglung Liang, Yanyong Zhang, A. Sivasubramaniam, M. Jette, R. Sahoo
International Conference on Dependable Systems and Networks (DSN'06)  
In this study, we have collected RAS event logs from BlueGene/L over a period of more than 100 days.  ...  The growing computational and storage needs of several scientific applications mandate the deployment of extreme-scale parallel machines, such as IBM's BlueGene/L which can accommodate as many as 128K  ...  as IBM BlueGene/L.  ... 
doi:10.1109/dsn.2006.18 dblp:conf/dsn/LiangZSJS06 fatcat:rctxd37jkfb6fiauia3vxffssi

Filtering Failure Logs for a BlueGene/L Prototype

Yinglung Liang, Yanyong Zhang, A. Sivasubramaniam, R.K. Sahoo, J. Moreira, M. Gupta
2005 International Conference on Dependable Systems and Networks (DSN'05)  
The growing computational and storage needs of several scientific applications mandate the deployment of extremescale parallel machines, such as IBM's BlueGene/L which can accommodate as many as 128K processors  ...  In this paper, we present our experiences in collecting and filtering error event logs from a 8192 processor BlueGene/L prototype at IBM Rochester, which is currently ranked #8 in the Top-500 list.  ...  Although fault-aware design has gained importance for uniprocessor and small-scale systems, the problem escalates to a much higher magnitude when we move to the large scale parallel systems.  ... 
doi:10.1109/dsn.2005.50 dblp:conf/dsn/LiangZSSMG05 fatcat:hribyyz6pnh57dhgpoqiewpmcu

A global operating system for HPC clusters

Emiliano Betti, Marco Cesati, Roberto Gioiosa, Francesco Piermaria
2009 2009 IEEE International Conference on Cluster Computing and Workshops  
These nodes run independent operating system kernels, thus synchronization among them is demanded for user mode programs. This means that temporal synchronization of the nodes is a daunting task.  ...  tailored for HPC applications.  ...  BlueGene/L (and later its successor, BlueGene/P) is the first example in this direction: the timers of each node of the supercomputer are continuously synchronized and, as a result, the "time" is the same  ... 
doi:10.1109/clustr.2009.5289191 dblp:conf/cluster/BettiCGP09 fatcat:ncceam2hvvdexbs3z6zc36rvxe

Performance under failures of high-end computing

Ming Wu, Xian-He Sun, Hui Jin
2007 Proceedings of the 2007 ACM/IEEE conference on Supercomputing - SC '07  
To benefit the end users of a given computing platform, we then develop effective fault-aware task scheduling algorithms to optimize application performance under system failures.  ...  A better understanding of faults' influence on application performance is necessary to use existing fault tolerant methods wisely.  ...  The potential resource failure probability has been considered in task scheduling in BlueGene/L systems [12] . These works demonstrate the significance of fault-aware task scheduling.  ... 
doi:10.1145/1362622.1362687 dblp:conf/sc/WuSJ07 fatcat:udidiazmc5dc5poiixo4ektqxm

Predictive Reliability and Fault Management in Exascale Systems

Ramon Canal, Carles Hernandez, Rafa Tornero, Alessandro Cilardo, Giuseppe Massari, Federico Reghenzani, William Fornaciari, Marina Zapater, David Atienza, Ariel Oleksiak, Wojciech PiĄtek, Jaume Abella
2020 ACM Computing Surveys  
Such ideas have been used by [116] to predict failures of the IBM's BlueGene/L system.  ...  For instance, in the IBM BlueGene/L supercomputer, a job experiencing two non-fatal events has a higher chance to experience a failure (above 5x) than if it only experiences one [116] .  ... 
doi:10.1145/3403956 fatcat:77xcpnevmnc5jfpj6ynhwdng3m

MCREngine: A scalable checkpointing system using data-aware aggregation and compression

Tanzima Zerin Islam, Kathryn Mohror, Saurabh Bagchi, Adam Moody, Bronis R. de Supinski, Rudolf Eigenmann
2012 2012 International Conference for High Performance Computing, Networking, Storage and Analysis  
As applications scale up, checkpoint-restart incurs high overheads due to contention for PFS resources.  ...  High performance computing (HPC) systems use checkpoint-restart to tolerate failures. Typically, applications store their states in checkpoints on a parallel file system (PFS).  ...  For example, the 100,000 node BlueGene/L system at Lawrence Livermore National Laboratory (LLNL) experiences an L1 cache parity error every 8 hours [1] and a hard failure every 7-10 days.  ... 
doi:10.1109/sc.2012.77 dblp:conf/sc/IslamMBMSE12 fatcat:42zz42q4onbs5l6w4xliop4jeq

McrEngine: A Scalable Checkpointing System Using Data-Aware Aggregation and Compression

Tanzima Zerin Islam, Kathryn Mohror, Saurabh Bagchi, Adam Moody, Bronis R. de Supinski, Rudolf Eigenmann
2013 Scientific Programming  
As applications scale up, checkpoint-restart incurs high overheads due to contention for PFS resources.  ...  High performance computing (HPC) systems use checkpoint-restart to tolerate failures. Typically, applications store their states in checkpoints on a parallel file system (PFS).  ...  For example, the 100,000 node BlueGene/L system at Lawrence Livermore National Laboratory (LLNL) experiences an L1 cache parity error every 8 hours [1] and a hard failure every 7-10 days.  ... 
doi:10.1155/2013/341672 fatcat:uqbxsvmeubdclezd6lbxd5f63a

High-performance computing systems: Status and outlook

J. J. Dongarra, A. J. van der Steen
2012 Acta Numerica  
In addition, we discuss the requirements for software that can take advantage of existing and future architectures.  ...  This article describes the current state of the art of high-performance computing systems, and attempts to shed light on near-future developments that might prolong the steady growth in speed of such systems  ...  similar to those applied to the PPC 440 in the BlueGene/L.  ... 
doi:10.1017/s0962492912000050 fatcat:n6yodkox5zb6xmlep6gvayud2m

Performance Implications of Failures in Large-Scale Cluster Scheduling [chapter]

Yanyong Zhang, Mark S. Squillante, Anand Sivasubramaniam, Ramendra K. Sahoo
2005 Lecture Notes in Computer Science  
performance for a wide range of scheduling policies.  ...  On the other hand, we demonstrate that information about the spatial and temporal correlation of failure occurrences can be very useful in designing a scheduling (job allocation) strategy to enhance system  ...  At the same time, scheduling can be used to improve the fault-tolerance [1, 27] of a system in three broad ways.  ... 
doi:10.1007/11407522_13 fatcat:zk3xx6rlgvderchqeh6ir34rca

Software challenges in extreme scale systems

Vivek Sarkar, William Harrod, Allan E Snavely
2009 Journal of Physics, Conference Series  
Carlson is a member of the research staff at the IDA Center for Computing Sciences where, since 1990, his focus has been on applications and system tools for large-scale parallel and distributed computers  ...  , for a range of real system applications, from highly scalable deep space exploration to trans-petaflops level supercomputing.  ...  "peak performance" with over 200 Tflop/s sustained performance (56% efficiency) on the LLNL BlueGene/L [9].  ... 
doi:10.1088/1742-6596/180/1/012045 fatcat:iukutry2dvbitfdh6ng7kgz564

Building Fuel Powered Supercomputing Data Center at Low Cost

Yiqing Hua, Chao Li, Weichao Tang, Li Jiang, Xiaoyao Liang
2015 Proceedings of the 29th ACM on International Conference on Supercomputing - ICS '15  
Other than dispatch computing tasks in bulk without considering power system behaviors, μBatch intelligently splits job queue into small sets and incrementally schedule jobs based on the power ramping  ...  Distributed power generations that fed with various economical clean fuels are emerging as promising power supplies for extremescale computing systems.  ...  ACKNOWLEDGEMENT We thank the anonymous reviewers for their valuable comments.  ... 
doi:10.1145/2751205.2751215 dblp:conf/ics/HuaLTJL15 fatcat:guwn2xxbmvapnlwenf5mcamwyq

Palacios and Kitten: New high performance operating systems for scalable virtualized and native supercomputing

John Lange, Kevin Pedretti, Trammell Hudson, Peter Dinda, Zheng Cui, Lei Xia, Patrick Bridges, Andy Gocke, Steven Jaconette, Mike Levenhagen, Ron Brightwell
2010 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS)  
This new architecture provides an incremental path for applications to use supercomputers, running specialized lightweight host operating systems, that is not significantly performance-compromised.  ...  Our benchmarks show that Palacios provides near native (within 5%), scalable performance for virtualized environments running important parallel applications.  ...  Other examples of this approach are the efforts to port Linux to the IBM BlueGene/L and BlueGene/P systems [22] , [23] .  ... 
doi:10.1109/ipdps.2010.5470482 dblp:conf/ipps/LangePHDCXBGJLB10 fatcat:pfcr3drdhzarxkvsuu436r33am

Efficient subtorus processor allocation in a multi-dimensional torus

Weizhen Mao, Jie Chen, W. Watson
2005 Eighth International Conference on High-Performance Computing in Asia-Pacific Region (HPCASIA'05)  
The simulation results show that our algorithms (especially the backfilling/NEP combination) are capable of producing schedules with system utilization and mean job bounded slowdowns comparable to those  ...  Specifically, our simulation experiments compare four algorithm combinations, FCFS/EP, FCFS/NEP, backfilling/EP, and backfilling/NEP, for two existing multi-dimensional torus connected systems.  ...  /L [11] (also a 3-D torus).  ... 
doi:10.1109/hpcasia.2005.35 fatcat:u2fmauv3pbfrhi676f2ndwi5ti

Toward Exascale Resilience

Franck Cappello, Al Geist, Bill Gropp, Laxmikant Kale, Bill Kramer, Marc Snir
2009 The international journal of high performance computing applications  
This set of projections leaves the community of fault tolerance for HPC system with a difficult challenge: finding new approaches, possibility radically disruptive, to run applications until their normal  ...  From the current knowledge and observations of existing large systems, it is anticipated that Exascale systems will experience various kind of faults many times per day.  ...  Some recent work [GCG07] on BlueGene/L suggests that simple dedicated approaches (in terms of time and energy), could solve specific, well understood faults (errors in the L1 cache in the case of BlueGene  ... 
doi:10.1177/1094342009347767 fatcat:s7i4a7aocnckzka4bxsyzbg6qi
« Previous Showing results 1 — 15 out of 41 results