A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2007; you can also visit the original URL.
The file type is application/pdf
.
Filters
Characterization of Scientific Workloads on Systems with Multi-Core Processors
2006
2006 IEEE International Symposium on Workload Characterization
In addition, we evaluated a number of processor affinity techniques for managing memory placement on these multi-core systems. ...
Multi-core processors are planned for virtually all next-generation HPC systems. ...
Simply put, the shared memory and I/O (network) bandwidth of multiple cores in a socket draws into question both how efficiently an application can use multiple cores and what methods provide the highest ...
doi:10.1109/iiswc.2006.302747
dblp:conf/iiswc/AlamBKRV06
fatcat:fd4kwtxn25fqtpd4rll2cyv4sq
Lattice Boltzmann simulation optimization on leading multicore platforms
2008
Proceedings, International Parallel and Distributed Processing Symposium (IPDPS)
Additionally, we present detailed analysis of each optimization, which reveal surprising hardware bottlenecks and software challenges for future multicore systems and applications. ...
We present an auto-tuning approach to optimize application performance on emerging multicore architectures. ...
We would also like to thank George Vahala and his research group for the original version of the LBMHD code. ...
doi:10.1109/ipdps.2008.4536295
dblp:conf/ipps/WilliamsCOSY08
fatcat:akmsmuhfjrcv3l2fku7endvt64
Optimized FFT computations on heterogeneous platforms with application to the Poisson equation
2014
Journal of Parallel and Distributed Computing
h i g h l i g h t s • New strategy to decompose large multi-dimensional FFTs on CPU-GPU platforms. • Executions of GPU kernels are almost completely overlapped with PCI bus transfer. • Multi-dimensional ...
data is transferred only once between the GPU and CPU. • Scheme is equally effective for the single and double precision computations. a b s t r a c t We develop optimized multi-dimensional FFT implementations ...
Acknowledgments This work was partially supported by an NSF PetaApps award, grant OCI0904920, the NVIDIA Research Excellence Center at the University of Maryland, and by an NSF Research Infrastructure ...
doi:10.1016/j.jpdc.2014.03.009
fatcat:efito37ujzdhzmrteulccqbisa
Optimizing UPC Programs for Multi-Core Systems
2010
Scientific Programming
Our results show that the optimized UPC programs achieve very good and scalable performance on current multi-core systems and can even outperform vendor-optimized libraries in some cases. ...
The Partitioned Global Address Space (PGAS) model of Unified Parallel C (UPC) can help users express and manage application data locality on non-uniform memory access (NUMA) multi-core shared-memory systems ...
Though UPC and other PGAS languages were initially focused on large scale distributed-memory machines, they are also a good fit for emerging multicore systems because the data partitioning capability of ...
doi:10.1155/2010/646829
fatcat:q63ngpj47jblhfzbfcdehsmuyi
Roofline
2009
Communications of the ACM
We propose an easy-to-understand, visual performance model that offers insights to programmers and architects on improving parallel software and hardware for floating point computations. ...
Jae Lee, Rajesh Nishtala, Heidi Pan, David Wessel, Mark Hill and the anonymous reviewers for feedback on early drafts of this paper. ...
Our thanks go to Joseph Gebis, Leonid Oliker, John Shalf, Katherine Yelick, and the rest of the Par Lab for feedback on the Roofline model, and to Jike Chong, Kaushik Datta, Mark Hoemmen, Matt Johnson, ...
doi:10.1145/1498765.1498785
fatcat:t4bx3edd5ba5hbfhd2xrpuo2si
Optimizing and tuning the fast multipole method for state-of-the-art multicore architectures
2010
2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS)
Victoria Falls (dual-sockets on all systems). ...
This work presents the first extensive study of singlenode performance optimization, tuning, and analysis of the fast multipole method (FMM) on modern multicore systems. ...
Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect those of NSF, DARPA, or Intel. ...
doi:10.1109/ipdps.2010.5470415
dblp:conf/ipps/ChandramowlishwaranWOLBV10
fatcat:p7tw54f5fza5ddjtyuwz7pn4gi
PERI - auto-tuning memory-intensive kernels for multicore
2008
Journal of Physics, Conference Series
Additionally, we analyze a Roofline performance model for each platform to reveal hardware bottlenecks and software challenges for future multicore systems and applications. ...
We present an auto-tuning approach to optimize application performance on emerging multicore architectures. ...
This work was supported by the ASCR Office in the DOE Office of Science under contract number DE-AC02-05CH11231, by NSF contract CNS-0325873, and by Microsoft and Intel Funding under award #20080469. ...
doi:10.1088/1742-6596/125/1/012038
fatcat:a66kzdasovb6lf63swvetlsfxq
Optimization of a lattice Boltzmann computation on state-of-the-art multicore platforms
2009
Journal of Parallel and Distributed Computing
Additionally, we present detailed analysis of each optimization, which reveal surprising hardware bottlenecks and software challenges for future multicore systems and applications. ...
We present an auto-tuning approach to optimize application performance on emerging multicore architectures. ...
Thus memory bandwidth is likely not an impediment to performance, allowing this Opteron to achieve nearly linear scaling for both the multicore and multi-socket experiments, as seen in Figure 11 (f). ...
doi:10.1016/j.jpdc.2009.04.002
fatcat:q26fu5e3tfezlbdpbond4bdglq
An Experimental Study on How to Build Efficient Multi-core Clusters for High Performance Computing
2008
2008 11th IEEE International Conference on Computational Science and Engineering
From Figure 3 , we can state that (1) bandwidth for either one-way or two-way communication on systems B and D are greater than for systems A and C for any message length. ...
Moreover, (2) bandwidth behavior of two-way communication for system D and of one-way communication for system B are quite similar. (3) Two-way communication bandwidth for system B is similar compared ...
A core is the atomic processing unit of a computing system. A socket contains one or more cores. ...
doi:10.1109/cse.2008.63
dblp:conf/cse/PintoTD08
fatcat:wz7gf5intvhspljwbsiprcul2a
Implementation of 3D FFTs Across Multiple GPUs in Shared Memory Environments
2012
2012 13th International Conference on Parallel and Distributed Computing, Applications and Technologies
This clearly shows that direct GPU-to-GPU transfers are the key factor in obtaining good performance on multi-GPU systems. ...
In this paper, a novel implementation of the distributed 3D Fast Fourier Transform (FFT) on a multi-GPU platform using CUDA is presented. ...
MULTI-GPU HARDWARE AND SOFTWARE The multi-GPU system used in this work is based on the Tyan barebone TYAN FT72B7015 [3] . ...
doi:10.1109/pdcat.2012.79
dblp:conf/pdcat/NandapalanJRT12
fatcat:4bphyjan2rgfxfxg5axpwv6rnu
Gromacs On Hybrid Cpu-Gpu And Cpu-Mic Clusters: Preliminary Porting Experiences, Results And Next Steps
2014
Zenodo
We present results that have been collected on the PRACE prototype systems as well as on other GPU and MIC accelerated platforms with similar configurations. ...
This report introduces hybrid implementation of the Gromacs application, and provides instructions on building and executing on PRACE prototype platforms with Grahpical Processing Units (GPU) and Many ...
The work was achieved using the PRACE Research Infrastructure resources at CSC, PSNC, CINECA and CSCS. ...
doi:10.5281/zenodo.822571
fatcat:g2vl3pizpnhrnmn6agtkz64lci
What GPU Computing Means for High-End Systems
2011
IEEE Micro
At exascale, we estimate that a large 3D FFT will spend 1,000Â more time on communication than on flops. ...
Why balance matters GPUs are a natural building block for an exascale system, given their high compute density (peak and bandwidth) and energy efficiency. ...
doi:10.1109/mm.2011.78
fatcat:g5a4gbr3gnaf5j64le3tnfk7lm
A Multicore Path to Connectomics-on-Demand
2017
Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming - PPoPP '17
on a single commodity multicore machine. ...
We present a high-throughput connectomics-ondemand system that runs on a multicore machine with less than 100 cores and extracts connectomes at the terabyte per hour pace of modern electron microscopes ...
Jones, Hanspeter Pfister, David Cox, and Jeff Lichtman. ...
doi:10.1145/3018743.3018766
fatcat:riw5dgkdm5c27oppgzmzcoesfy
A Multicore Path to Connectomics-on-Demand
2016
Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures - SPAA '16
on a single commodity multicore machine. ...
We present a high-throughput connectomics-ondemand system that runs on a multicore machine with less than 100 cores and extracts connectomes at the terabyte per hour pace of modern electron microscopes ...
Jones, Hanspeter Pfister, David Cox, and Jeff Lichtman. ...
doi:10.1145/2935764.2935825
dblp:conf/spaa/Shavit16
fatcat:tifxkx5cu5fotlqmfxmod24bh4
A survey on hardware-aware and heterogeneous computing on multicore processors and accelerators
2011
Concurrency and Computation
In this work we provide a survey on current multicore and accelerator technologies. ...
In particular, we characterize the discrepancy to conventional parallel platforms with respect to hierarchical memory sub-systems, fine-grained parallelism on several system levels, and chip-and system-level ...
Acknowledgements The Shared Research Group 16-1 received financial support by the Concept for the Future of Karlsruhe Institute of Technology in the framework of the German Excellence Initiative and the ...
doi:10.1002/cpe.1904
fatcat:fwg2vjaobral3b2v46vq4x2c3q
« Previous
Showing results 1 — 15 out of 242 results