A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2005; you can also visit the original URL.
The file type is application/pdf
.
Filters
Automatic compiler techniques for thread coarsening for multithreaded architectures
2000
Proceedings of the 14th international conference on Supercomputing - ICS '00
Thread partitioning is the most important task in compiling high-level languages for multithreaded architectures. ...
Our experiments were performed using the EARTH-C compiler and the EARTH multithreaded architecture model emulated on both a cluster of Pentium PCs and a distributed memory multiprocessor. ...
Acknowledgments We would like to thank Laurie Hendren and the ACAPS group at McGill University for providing us with a copy of the EARTH-C compiler. ...
doi:10.1145/335231.335261
dblp:conf/ics/ZoppettiAPATG00
fatcat:k3ny2hxqtbdchjczrnxtynivbu
Parallelization of a dynamic unstructured algorithm using three leading programming paradigms
2000
IEEE Transactions on Parallel and Distributed Systems
version on the newly-released Tera Multithreaded Architecture (MTA). ...
Our overall results demonstrate that multithreaded systems offer tremendous potential for quickly and efficiently solving some of the most challenging real-life problems on parallel computers. ...
This multithreaded architecture is especially well-suited for irregular and dynamic problems. ...
doi:10.1109/71.879776
fatcat:6gjamrhcmrb7biabyr2yguzwrq
Parallelization of a dynamic unstructured application using three leading paradigms
1999
Proceedings of the 1999 ACM/IEEE conference on Supercomputing (CDROM) - Supercomputing '99
version on the newly-released Tera Multithreaded Architecture (MTA). ...
Our overall results demonstrate that multithreaded systems offer tremendous potential for quickly and efficiently solving some of the most challenging real-life problems on parallel computers. ...
This multithreaded architecture is especially well-suited for irregular and dynamic problems. ...
doi:10.1145/331532.331571
dblp:conf/sc/OlikerB99
fatcat:3fbdskwh3bb3dnpwm73zag4xlm
Automatic Restructuring of GPU Kernels for Exploiting Inter-thread Data Locality
[chapter]
2012
Lecture Notes in Computer Science
More importantly, the results demonstrate a clear need for automatic control of thread granularity at the software level for achieving higher performance. ...
Hundreds of cores per chip and support for fine-grain multithreading have made GPUs a central player in today's HPC world. ...
Acknowledgement We would like to thank the reviewers for helping us improve the quality of the final version of this paper. We also thank Dr. Martin Burtscher for allowing us compute time on his GPUs. ...
doi:10.1007/978-3-642-28652-0_2
fatcat:nm4nhqvkajf3zg5dqwooe7b2ky
Towards a first vertical prototyping of an extremely fine-grained parallel programming approach
2001
Proceedings of the thirteenth annual ACM symposium on Parallel algorithms and architectures - SPAA '01
Explicit-multithreading (XMT) is a parallel programming approach for exploiting on-chip parallelism. ...
The missing link between the algorithmic-programming level and the architecture level is provided by the first prototype XMT compiler. ...
However, compiler optimizations to cluster (coarsen) threads are still needed for very fine-grained threads. ...
doi:10.1145/378580.378597
dblp:conf/spaa/NaishlosNTV01
fatcat:6r7qxrrqtvdvzfwtqebfyrhe4u
Towards a First Vertical Prototyping of an Extremely Fine-Grained Parallel Programming Approach
2003
Theory of Computing Systems
Explicit-multithreading (XMT) is a parallel programming approach for exploiting on-chip parallelism. ...
The missing link between the algorithmic-programming level and the architecture level is provided by the first prototype XMT compiler. ...
However, compiler optimizations to cluster (coarsen) threads are still needed for very fine-grained threads. ...
doi:10.1007/s00224-003-1086-6
fatcat:ptjyw4sdrjhj3fhtqicxkuowpy
Evaluating the XMT Parallel Programming Model
[chapter]
2001
Lecture Notes in Computer Science
Despite low thread overhead, thread coarsening is still necessary to some extent, but can usually be automatically applied by the XMT compiler. ...
Explicit-multithreading (XMT) is a parallel programming model designed for exploiting on-chip parallelism. ...
The XMT compiler detects such cases, and automatically transforms them such that fewer but longer threads are used. ...
doi:10.1007/3-540-45401-2_8
fatcat:lzwe6a3bdfgjjmrj2arn2uakme
Optimization and architecture effects on GPU computing workload performance
2012
2012 Innovative Parallel Computing (InPar)
Several design principles of CPU architectures have been and will likely continue to be very consistent, such as SIMT and high de grees of multithreading. ...
We have surveyed many CPU com puting applications and kernels and distilled what we believe to be several key optimization techniques and design consid erations for high-performance CPU-computing workloads ...
This is why privatization is an extremely powerful technique for today's CMPs, with a relatively small number of threads, but some what limited for the levels of thread parallelism in highly multithreaded ...
doi:10.1109/inpar.2012.6339605
fatcat:z7ujhcbv4rdwhfd2hchcw766fy
Eliminating synchronization bottlenecks using adaptive replication
2003
ACM Transactions on Programming Languages and Systems
This article presents a new technique, adaptive replication, for automatically eliminating synchronization bottlenecks in multithreaded programs that perform atomic operations on objects. ...
We have implemented adaptive replication in the context of a parallelizing compiler for a subset of C++. ...
ACKNOWLEDGMENTS We would like to the anonymous referees of various versions of this article for their thoughtful and helpful comments. ...
doi:10.1145/641909.641911
fatcat:6ftcwn2lbbc3vhv2qb7spqujfm
Analysis of Task Offloading for Accelerators
[chapter]
2010
Lecture Notes in Computer Science
Overall, our performance is better compared to the results obtained from the IBM compiler for the Cell processor. ...
for programmers to offload parts of their applications to the auxiliary processors. ...
Acknowledgements We would like to thank the Barcelona Supercomputing Center (BSC) for the use of their machines. ...
doi:10.1007/978-3-642-11515-8_24
fatcat:d3kcmhhhznakfeivzung7346r4
Mapping and optimization of the AVS video decoder on a high performance chip multiprocessor
2010
2010 IEEE International Conference on Multimedia and Expo
The input dependent variability of execution time per work chunk is addressed using dynamic scheduling to allocate work to each thread. ...
This paper presents the implementation, optimization and characterization of the AVS video decoder on Intel Core i7, a quad-core, hyper-threaded, chip multiprocessor (CMP). ...
threads), multithreaded version (4 threads) and final, vectorized, multithreaded code (4 threads). ...
doi:10.1109/icme.2010.5582558
dblp:conf/icmcs/KrommydasTAB10
fatcat:36fjeahcfvayngwne4id6t5orm
Scheduling threads for constructive cache sharing on CMPs
2007
Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures - SPAA '07
Many multithreaded programs provide opportunities for constructive cache sharing, in which concurrently scheduled threads share a largely overlapping working set. ...
In this paper, we compare the performance of two state-of-the-art schedulers proposed for fine-grained multithreaded programs: Parallel Depth First (PDF), which is specifically designed for constructive ...
Using Profiling Information for Automatic Task Coarsening The automatic task coarsening algorithm traverses the task group tree from top to bottom and evaluates a heuristic stop criterion at every node ...
doi:10.1145/1248377.1248396
dblp:conf/spaa/ChenGKLABFFHMW07
fatcat:7zuvfmkmorbzzdwlmkdl5pmwa4
Effects of Ordering Strategies and Programming Paradigms on Sparse Matrix Computations
2002
SIAM Review
Thanks alsoto BruceHendrickson andtheotheranonymous referees for theirsuggestionsthat helped improve thepaper. ...
promising
strategy
for hybrid
architectures. ...
ity
for PCG
due to a lack of thread
level par_!elism. ...
doi:10.1137/s00361445003820
fatcat:o7uwxsbcfnf4ppcfc2sbpxncmm
Dynamic tiling for effective use of shared caches on multithreaded processors
2004
International Journal of High Performance Computing and Networking
The key idea is to use two tile sizes in the program, one for single-threaded execution mode and one suitable for multithreaded execution mode and switch between tile sizes at runtime. ...
Simultaneous multithreaded (SMT) processors use data caches which are dynamically shared between threads. ...
Acknowledgement The authors would like to thank the IJHPCN referees for several helpful suggestions. A preliminary version of this work has been published (Nikolopoulos, 2003) . ...
doi:10.1504/ijhpcn.2004.009265
fatcat:as5smmoulnaavnchbcokabjbmu
Compiler generation and autotuning of communication-avoiding operators for geometric multigrid
2013
20th Annual International Conference on High Performance Computing
identify the best implementation for a particular architecture and at each computation phase. ...
To make the approach portable, an underlying autotuning system explores the tradeoff between reduced communication and increased computation, as well as tradeoffs in threading schemes, to automatically ...
As modern architectures continue to grow in core count and exhibit a hierarchy of complex inter-thread and inter-process interactions, new communication-avoiding techniques have been introduced for GMG ...
doi:10.1109/hipc.2013.6799131
dblp:conf/hipc/BasuVHWSO13
fatcat:e7jdczjskbdh7buasco2fhrm7m
« Previous
Showing results 1 — 15 out of 170 results