Filters








242 Hits in 4.3 sec

Characterization of Scientific Workloads on Systems with Multi-Core Processors

Sadaf Alam, Richard Barrett, Jeffery Kuehn, Philip Roth, Jeffrey Vetter
2006 2006 IEEE International Symposium on Workload Characterization  
In addition, we evaluated a number of processor affinity techniques for managing memory placement on these multi-core systems.  ...  Multi-core processors are planned for virtually all next-generation HPC systems.  ...  Simply put, the shared memory and I/O (network) bandwidth of multiple cores in a socket draws into question both how efficiently an application can use multiple cores and what methods provide the highest  ... 
doi:10.1109/iiswc.2006.302747 dblp:conf/iiswc/AlamBKRV06 fatcat:fd4kwtxn25fqtpd4rll2cyv4sq

Lattice Boltzmann simulation optimization on leading multicore platforms

Samuel Williams, Jonathan Carter, Leonid Oliker, John Shalf, Katherine Yelick
2008 Proceedings, International Parallel and Distributed Processing Symposium (IPDPS)  
Additionally, we present detailed analysis of each optimization, which reveal surprising hardware bottlenecks and software challenges for future multicore systems and applications.  ...  We present an auto-tuning approach to optimize application performance on emerging multicore architectures.  ...  We would also like to thank George Vahala and his research group for the original version of the LBMHD code.  ... 
doi:10.1109/ipdps.2008.4536295 dblp:conf/ipps/WilliamsCOSY08 fatcat:akmsmuhfjrcv3l2fku7endvt64

Optimized FFT computations on heterogeneous platforms with application to the Poisson equation

Jing Wu, Joseph JaJa
2014 Journal of Parallel and Distributed Computing  
h i g h l i g h t s • New strategy to decompose large multi-dimensional FFTs on CPU-GPU platforms. • Executions of GPU kernels are almost completely overlapped with PCI bus transfer. • Multi-dimensional  ...  data is transferred only once between the GPU and CPU. • Scheme is equally effective for the single and double precision computations. a b s t r a c t We develop optimized multi-dimensional FFT implementations  ...  Acknowledgments This work was partially supported by an NSF PetaApps award, grant OCI0904920, the NVIDIA Research Excellence Center at the University of Maryland, and by an NSF Research Infrastructure  ... 
doi:10.1016/j.jpdc.2014.03.009 fatcat:efito37ujzdhzmrteulccqbisa

Optimizing UPC Programs for Multi-Core Systems

Yili Zheng
2010 Scientific Programming  
Our results show that the optimized UPC programs achieve very good and scalable performance on current multi-core systems and can even outperform vendor-optimized libraries in some cases.  ...  The Partitioned Global Address Space (PGAS) model of Unified Parallel C (UPC) can help users express and manage application data locality on non-uniform memory access (NUMA) multi-core shared-memory systems  ...  Though UPC and other PGAS languages were initially focused on large scale distributed-memory machines, they are also a good fit for emerging multicore systems because the data partitioning capability of  ... 
doi:10.1155/2010/646829 fatcat:q63ngpj47jblhfzbfcdehsmuyi

Roofline

Samuel Williams, Andrew Waterman, David Patterson
2009 Communications of the ACM  
We propose an easy-to-understand, visual performance model that offers insights to programmers and architects on improving parallel software and hardware for floating point computations.  ...  Jae Lee, Rajesh Nishtala, Heidi Pan, David Wessel, Mark Hill and the anonymous reviewers for feedback on early drafts of this paper.  ...  Our thanks go to Joseph Gebis, Leonid Oliker, John Shalf, Katherine Yelick, and the rest of the Par Lab for feedback on the Roofline model, and to Jike Chong, Kaushik Datta, Mark Hoemmen, Matt Johnson,  ... 
doi:10.1145/1498765.1498785 fatcat:t4bx3edd5ba5hbfhd2xrpuo2si

Optimizing and tuning the fast multipole method for state-of-the-art multicore architectures

Aparna Chandramowlishwaran, Samuel Williams, Leonid Oliker, Ilya Lashuk, George Biros, Richard Vuduc
2010 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS)  
Victoria Falls (dual-sockets on all systems).  ...  This work presents the first extensive study of singlenode performance optimization, tuning, and analysis of the fast multipole method (FMM) on modern multicore systems.  ...  Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect those of NSF, DARPA, or Intel.  ... 
doi:10.1109/ipdps.2010.5470415 dblp:conf/ipps/ChandramowlishwaranWOLBV10 fatcat:p7tw54f5fza5ddjtyuwz7pn4gi

PERI - auto-tuning memory-intensive kernels for multicore

S Williams, K Datta, J Carter, L Oliker, J Shalf, K Yelick, D Bailey
2008 Journal of Physics, Conference Series  
Additionally, we analyze a Roofline performance model for each platform to reveal hardware bottlenecks and software challenges for future multicore systems and applications.  ...  We present an auto-tuning approach to optimize application performance on emerging multicore architectures.  ...  This work was supported by the ASCR Office in the DOE Office of Science under contract number DE-AC02-05CH11231, by NSF contract CNS-0325873, and by Microsoft and Intel Funding under award #20080469.  ... 
doi:10.1088/1742-6596/125/1/012038 fatcat:a66kzdasovb6lf63swvetlsfxq

Optimization of a lattice Boltzmann computation on state-of-the-art multicore platforms

Samuel Williams, Jonathan Carter, Leonid Oliker, John Shalf, Katherine Yelick
2009 Journal of Parallel and Distributed Computing  
Additionally, we present detailed analysis of each optimization, which reveal surprising hardware bottlenecks and software challenges for future multicore systems and applications.  ...  We present an auto-tuning approach to optimize application performance on emerging multicore architectures.  ...  Thus memory bandwidth is likely not an impediment to performance, allowing this Opteron to achieve nearly linear scaling for both the multicore and multi-socket experiments, as seen in Figure 11 (f).  ... 
doi:10.1016/j.jpdc.2009.04.002 fatcat:q26fu5e3tfezlbdpbond4bdglq

An Experimental Study on How to Build Efficient Multi-core Clusters for High Performance Computing

Luiz Carlos Pinto, Luiz H. B. Tomazella, M. A. R. Dantas
2008 2008 11th IEEE International Conference on Computational Science and Engineering  
From Figure 3 , we can state that (1) bandwidth for either one-way or two-way communication on systems B and D are greater than for systems A and C for any message length.  ...  Moreover, (2) bandwidth behavior of two-way communication for system D and of one-way communication for system B are quite similar. (3) Two-way communication bandwidth for system B is similar compared  ...  A core is the atomic processing unit of a computing system. A socket contains one or more cores.  ... 
doi:10.1109/cse.2008.63 dblp:conf/cse/PintoTD08 fatcat:wz7gf5intvhspljwbsiprcul2a

Implementation of 3D FFTs Across Multiple GPUs in Shared Memory Environments

Nimalan Nandapalan, Jiri Jaros, Alistair P. Rendell, Bradley Treeby
2012 2012 13th International Conference on Parallel and Distributed Computing, Applications and Technologies  
This clearly shows that direct GPU-to-GPU transfers are the key factor in obtaining good performance on multi-GPU systems.  ...  In this paper, a novel implementation of the distributed 3D Fast Fourier Transform (FFT) on a multi-GPU platform using CUDA is presented.  ...  MULTI-GPU HARDWARE AND SOFTWARE The multi-GPU system used in this work is based on the Tyan barebone TYAN FT72B7015 [3] .  ... 
doi:10.1109/pdcat.2012.79 dblp:conf/pdcat/NandapalanJRT12 fatcat:4bphyjan2rgfxfxg5axpwv6rnu

Gromacs On Hybrid Cpu-Gpu And Cpu-Mic Clusters: Preliminary Porting Experiences, Results And Next Steps

Sadaf Alam
2014 Zenodo  
We present results that have been collected on the PRACE prototype systems as well as on other GPU and MIC accelerated platforms with similar configurations.  ...  This report introduces hybrid implementation of the Gromacs application, and provides instructions on building and executing on PRACE prototype platforms with Grahpical Processing Units (GPU) and Many  ...  The work was achieved using the PRACE Research Infrastructure resources at CSC, PSNC, CINECA and CSCS.  ... 
doi:10.5281/zenodo.822571 fatcat:g2vl3pizpnhrnmn6agtkz64lci

What GPU Computing Means for High-End Systems

Richard Vuduc, Kent Czechowski
2011 IEEE Micro  
At exascale, we estimate that a large 3D FFT will spend 1,000Â more time on communication than on flops.  ...  Why balance matters GPUs are a natural building block for an exascale system, given their high compute density (peak and bandwidth) and energy efficiency.  ... 
doi:10.1109/mm.2011.78 fatcat:g5a4gbr3gnaf5j64le3tnfk7lm

A Multicore Path to Connectomics-on-Demand

Alexander Matveev, Yaron Meirovitch, Hayk Saribekyan, Wiktor Jakubiuk, Tim Kaler, Gergely Odor, David Budden, Aleksandar Zlateski, Nir Shavit
2017 Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming - PPoPP '17  
on a single commodity multicore machine.  ...  We present a high-throughput connectomics-ondemand system that runs on a multicore machine with less than 100 cores and extracts connectomes at the terabyte per hour pace of modern electron microscopes  ...  Jones, Hanspeter Pfister, David Cox, and Jeff Lichtman.  ... 
doi:10.1145/3018743.3018766 fatcat:riw5dgkdm5c27oppgzmzcoesfy

A Multicore Path to Connectomics-on-Demand

Nir Shavit
2016 Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures - SPAA '16  
on a single commodity multicore machine.  ...  We present a high-throughput connectomics-ondemand system that runs on a multicore machine with less than 100 cores and extracts connectomes at the terabyte per hour pace of modern electron microscopes  ...  Jones, Hanspeter Pfister, David Cox, and Jeff Lichtman.  ... 
doi:10.1145/2935764.2935825 dblp:conf/spaa/Shavit16 fatcat:tifxkx5cu5fotlqmfxmod24bh4

A survey on hardware-aware and heterogeneous computing on multicore processors and accelerators

Rainer Buchty, Vincent Heuveline, Wolfgang Karl, Jan-Philipp Weiss
2011 Concurrency and Computation  
In this work we provide a survey on current multicore and accelerator technologies.  ...  In particular, we characterize the discrepancy to conventional parallel platforms with respect to hierarchical memory sub-systems, fine-grained parallelism on several system levels, and chip-and system-level  ...  Acknowledgements The Shared Research Group 16-1 received financial support by the Concept for the Future of Karlsruhe Institute of Technology in the framework of the German Excellence Initiative and the  ... 
doi:10.1002/cpe.1904 fatcat:fwg2vjaobral3b2v46vq4x2c3q
« Previous Showing results 1 — 15 out of 242 results