A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2017; you can also visit the original URL.
The file type is application/pdf
.
Filters
Automatic data and computation partitioning on scalable shared memory multiprocessors
[chapter]
1997
Lecture Notes in Computer Science
determined by the NUMA-CAG, and dynamic partitions obtained by re-partitioning the array between the two loop nests. ...
Additionally, the suitability of data and/or computation partitions depends not only on data access patterns, but also on the characteristics of the target multiprocessor. ...
doi:10.1007/bfb0017282
fatcat:x4yiviptafayvnjcsl2lh3rvta
SPM-aware scheduling for nested loops in CMP systems
2013
ACM SIGBED Review
BACKGROUND AND RELATED WORK Most prior work on nested loops either focuses on loop partitioning such as tiling, collapsing, and transformation, etc, or focuses on scheduling loops on different processors ...
2) how to schedule nested loops on different SPMs of each processor in a CMP systems? ...
doi:10.1145/2518148.2518151
fatcat:gc2uogmhs5epvcvv7xvxhsakfu
A data locality optimizing algorithm
2004
SIGPLAN notices
optimizing locality across arbitrary loop nests based on affine partitioning [6] . ...
Our paper "A Data Locality Optimizing Algorithm" presented an automatic blocking algorithm for perfect loop nests on uniprocessors and multiprocessors [8] . ...
doi:10.1145/989393.989437
fatcat:eajlltosg5gjbhqtdituqhh3hi
A data locality optimizing algorithm
1991
SIGPLAN notices
optimizing locality across arbitrary loop nests based on affine partitioning [6] . ...
Our paper "A Data Locality Optimizing Algorithm" presented an automatic blocking algorithm for perfect loop nests on uniprocessors and multiprocessors [8] . ...
doi:10.1145/113446.113449
fatcat:2ovh6wjxmjhorne6ox3dlj47ha
Enhancing the performance of autoscheduling in Distributed Shared Memory multiprocessors
[chapter]
1998
Lecture Notes in Computer Science
scheduling of parallel tasks, and dynamic program adaptability on multiprogrammed shared memory multiprocessors. ...
Our technique partitions the application Hierarchical Task Graph and maps the derived partitions to clusters of processors in the DSM architecture. ...
, all our partners in the NANOS project and George Tsolis for his help in improving the appearance of the paper. ...
doi:10.1007/bfb0057892
fatcat:sdudtgqxeje3jhwt56fxhzipwq
A compiler framework for optimization of affine loop nests for gpgpus
2008
Proceedings of the 22nd annual international conference on Supercomputing - ICS '08
In this paper, a number of issues are addressed towards the goal of developing a compiler framework for automatic parallelization and performance optimization of affine loop nests on GPGPUs: 1) approach ...
factors for conflict-minimal data access from GPU shared memory; and 3) model-driven empirical search to determine optimal parameters for unrolling and tiling. ...
National Science Foundation through awards 0121676, 0121706, 0403342, 0508245, 0509442, 0509467 and 0541409. ...
doi:10.1145/1375527.1375562
dblp:conf/ics/BaskaranBKRRS08
fatcat:x6rdnmlkvzaw7jfcet3pxzsewi
Fusion of loops for parallelism and locality
1997
IEEE Transactions on Parallel and Distributed Systems
The techniques are evaluated on a 56-processor KSR2 multiprocessor and on a 16-processor Convex SPP-1000 multiprocessor. ...
In this paper, we present new techniques to: (1) allow fusion of loop nests in the presence of fusion-preventing dependences, (2) maintain parallelism and allow the parallel execution of fused loops with ...
The use of the KSR2 and Convex SPP-1000 multiprocessors was provided by the University of Michigan Center for Parallel Computing. ...
doi:10.1109/71.577265
fatcat:rticunkxsbgpvjoa4hbczr6tei
Data layout optimization for GPGPU architectures
2013
Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming - PPoPP '13
Specifically, we try to optimize the layout of the arrays accessed in affine loop nests, for both the device memory and shared memory, at both coarse grain and fine grain parallelization levels. ...
Our approach employs a widely applicable strategy based on a novel concept called data localization. ...
Acknowledgments This research is supported in part by NSF grants #1213052, #1152479, #1147388, #1139023, #1017882, #0963839, #0811687 and a grant from Microsoft Corporation. ...
doi:10.1145/2442516.2442546
dblp:conf/ppopp/LiuDJK13
fatcat:ldaeu4642zh4lawlctxwbghn7q
On-chip memory space partitioning for chip multiprocessors using polyhedral algebra
2010
IET Computers & Digital Techniques
We evaluate the resulting memory configurations using a set of benchmarks and compare them to pure private and pure shared memory on-chip multiprocessor architectures. ...
One of the most important issues in designing a chip multiprocessor is to decide its on-chip memory organisation. ...
Acknowledgments This research is supported in part by NSF grants CNS #0720645, CCF #0811687, CCF #0702519, CNS #0202007, CNS #0509251, by a grant from Microsoft Corporation, by a grant from IBM, and by ...
doi:10.1049/iet-cdt.2009.0089
fatcat:n4apvjbzsbd4hfex4x7cjer72e
Exploiting wavefront parallelism on large-scale shared-memory multiprocessors
2001
IEEE Transactions on Parallel and Distributed Systems
AbstractÐWavefront parallelism, in which parallelism is limited to hyperplanes in an iteration space, can arise when compilers apply tiling to loop nests to enhance locality. ...
In this paper, we show that on large-scale shared-memory multiprocessors, locality is a crucial factor. ...
ACKNOWLEDGMENTS This research was supported by the Natural Sciences and Engineering Research Council of Canada and the Information Technology Research Centre of Ontario. ...
doi:10.1109/71.914756
fatcat:2ucmp4mx2vcuxmbyt4mx7ftsey
Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories
2008
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming - PPoPP '08
We also address the problem of mapping computation in regular programs to multi-level parallel architectures using a multi-level tiling approach, and study the impact of on-chip memory availability on ...
Several parallel architectures such as GPUs and the Cell processor have fast explicitly managed on-chip memories, in addition to slow off-chip memory. ...
National Science Foundation through awards 0121676, 0121706, 0403342, 0508245, 0509442, 0509467 and 0541409. ...
doi:10.1145/1345206.1345210
dblp:conf/ppopp/BaskaranBKRRS08
fatcat:norktuzqincvjpf3qaa4hlklu4
Design of parallel algorithms for a distributed memory hypercube
1992
Microprocessors and microsystems
On the other hand, the structured model, or SPMD model (singleprogram/multiple-data), attempts to mimic the simplicity in the programming of synchronous parallel systems. ...
In the unstructured model, or MPMD (multiple-program/multiple-data), each processor will have its own local data and its own local program that will process this data. ...
ACKNOWLEDGEMENTS This work has been partially supported by grants TIC88-0094, MIC88-0549, MIC90-1264-E and TIC90-0407 of the CICYT and XUGA20604A90 of the Xunta de Galicia. ...
doi:10.1016/0141-9331(92)90107-5
fatcat:5ikfflucsbdr3dk5tlq5lnvx6y
OpenMP to GPGPU
2008
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming - PPoPP '09
Experimental results from two important kernels (JACOBI and SPMUL) and two NAS OpenMP Parallel Benchmarks (EP and CG) show that the described translator and compile-time optimizations work well on both ...
The goal of this translation is to further improve programmability and make existing OpenMP applications amenable to execution on GPGPUs. ...
Acknowledgments This work was supported, in part, by the National Science Foundation under grants No. 0429535-CCF, CNS-0751153, and 0833115-CCF. ...
doi:10.1145/1504176.1504194
dblp:conf/ppopp/LeeME09
fatcat:7ru27sozu5h5hhlni4w4cdx6hi
OpenMP to GPGPU
2009
SIGPLAN notices
Experimental results from two important kernels (JACOBI and SPMUL) and two NAS OpenMP Parallel Benchmarks (EP and CG) show that the described translator and compile-time optimizations work well on both ...
The goal of this translation is to further improve programmability and make existing OpenMP applications amenable to execution on GPGPUs. ...
Acknowledgments This work was supported, in part, by the National Science Foundation under grants No. 0429535-CCF, CNS-0751153, and 0833115-CCF. ...
doi:10.1145/1594835.1504194
fatcat:wbpl7ohbzffedndc6s6tafkfny
Efficient runtime thread management for the nano-threads programming model
[chapter]
1998
Lecture Notes in Computer Science
The proposed mechanisms attempt to obtain maximum benefits from data locality on cache-coherent NUMA multiprocessors. ...
The nano-threads programming model was proposed to effectively integrate multiprogramming on shared-memory multiprocessors, with the exploitation of fine-grain parallelism from standard applications. ...
and the referees for their helpful comments. ...
doi:10.1007/3-540-64359-1_688
fatcat:54eahhbswfb2pe37piqtibzpoi
« Previous
Showing results 1 — 15 out of 2,249 results