Filters








170 Hits in 2.7 sec

Automatic compiler techniques for thread coarsening for multithreaded architectures

Gary M. Zoppetti, Gagan Agrawal, Lori Pollock, Jose Nelson Amaral, Xinan Tang, Guang Gao
2000 Proceedings of the 14th international conference on Supercomputing - ICS '00  
Thread partitioning is the most important task in compiling high-level languages for multithreaded architectures.  ...  Our experiments were performed using the EARTH-C compiler and the EARTH multithreaded architecture model emulated on both a cluster of Pentium PCs and a distributed memory multiprocessor.  ...  Acknowledgments We would like to thank Laurie Hendren and the ACAPS group at McGill University for providing us with a copy of the EARTH-C compiler.  ... 
doi:10.1145/335231.335261 dblp:conf/ics/ZoppettiAPATG00 fatcat:k3ny2hxqtbdchjczrnxtynivbu

Parallelization of a dynamic unstructured algorithm using three leading programming paradigms

L. Oliker, R. Biswas
2000 IEEE Transactions on Parallel and Distributed Systems  
version on the newly-released Tera Multithreaded Architecture (MTA).  ...  Our overall results demonstrate that multithreaded systems offer tremendous potential for quickly and efficiently solving some of the most challenging real-life problems on parallel computers.  ...  This multithreaded architecture is especially well-suited for irregular and dynamic problems.  ... 
doi:10.1109/71.879776 fatcat:6gjamrhcmrb7biabyr2yguzwrq

Parallelization of a dynamic unstructured application using three leading paradigms

Leonid Oliker, Rupak Biswas
1999 Proceedings of the 1999 ACM/IEEE conference on Supercomputing (CDROM) - Supercomputing '99  
version on the newly-released Tera Multithreaded Architecture (MTA).  ...  Our overall results demonstrate that multithreaded systems offer tremendous potential for quickly and efficiently solving some of the most challenging real-life problems on parallel computers.  ...  This multithreaded architecture is especially well-suited for irregular and dynamic problems.  ... 
doi:10.1145/331532.331571 dblp:conf/sc/OlikerB99 fatcat:3fbdskwh3bb3dnpwm73zag4xlm

Automatic Restructuring of GPU Kernels for Exploiting Inter-thread Data Locality [chapter]

Swapneela Unkule, Christopher Shaltz, Apan Qasem
2012 Lecture Notes in Computer Science  
More importantly, the results demonstrate a clear need for automatic control of thread granularity at the software level for achieving higher performance.  ...  Hundreds of cores per chip and support for fine-grain multithreading have made GPUs a central player in today's HPC world.  ...  Acknowledgement We would like to thank the reviewers for helping us improve the quality of the final version of this paper. We also thank Dr. Martin Burtscher for allowing us compute time on his GPUs.  ... 
doi:10.1007/978-3-642-28652-0_2 fatcat:nm4nhqvkajf3zg5dqwooe7b2ky

Towards a first vertical prototyping of an extremely fine-grained parallel programming approach

Dorit Naishlos, Joseph Nuzman, Chau-Wen Tseng, Uzi Vishkin
2001 Proceedings of the thirteenth annual ACM symposium on Parallel algorithms and architectures - SPAA '01  
Explicit-multithreading (XMT) is a parallel programming approach for exploiting on-chip parallelism.  ...  The missing link between the algorithmic-programming level and the architecture level is provided by the first prototype XMT compiler.  ...  However, compiler optimizations to cluster (coarsen) threads are still needed for very fine-grained threads.  ... 
doi:10.1145/378580.378597 dblp:conf/spaa/NaishlosNTV01 fatcat:6r7qxrrqtvdvzfwtqebfyrhe4u

Towards a First Vertical Prototyping of an Extremely Fine-Grained Parallel Programming Approach

Dorit Naishlos, Joseph Nuzman, Chau-Wen Tseng, Uzi Vishkin
2003 Theory of Computing Systems  
Explicit-multithreading (XMT) is a parallel programming approach for exploiting on-chip parallelism.  ...  The missing link between the algorithmic-programming level and the architecture level is provided by the first prototype XMT compiler.  ...  However, compiler optimizations to cluster (coarsen) threads are still needed for very fine-grained threads.  ... 
doi:10.1007/s00224-003-1086-6 fatcat:ptjyw4sdrjhj3fhtqicxkuowpy

Evaluating the XMT Parallel Programming Model [chapter]

Dorit Naishlos, Joseph Nuzman, Chau-Wen Tseng, Uzi Vishkin
2001 Lecture Notes in Computer Science  
Despite low thread overhead, thread coarsening is still necessary to some extent, but can usually be automatically applied by the XMT compiler.  ...  Explicit-multithreading (XMT) is a parallel programming model designed for exploiting on-chip parallelism.  ...  The XMT compiler detects such cases, and automatically transforms them such that fewer but longer threads are used.  ... 
doi:10.1007/3-540-45401-2_8 fatcat:lzwe6a3bdfgjjmrj2arn2uakme

Optimization and architecture effects on GPU computing workload performance

John A. Stratton, Nasser Anssari, Christopher Rodrigues, I-Jui Sung, Nady Obeid, Liwen Chang, Geng Daniel Liu, Wen-mei Hwu
2012 2012 Innovative Parallel Computing (InPar)  
Several design principles of CPU architectures have been and will likely continue to be very consistent, such as SIMT and high de grees of multithreading.  ...  We have surveyed many CPU com puting applications and kernels and distilled what we believe to be several key optimization techniques and design consid erations for high-performance CPU-computing workloads  ...  This is why privatization is an extremely powerful technique for today's CMPs, with a relatively small number of threads, but some what limited for the levels of thread parallelism in highly multithreaded  ... 
doi:10.1109/inpar.2012.6339605 fatcat:z7ujhcbv4rdwhfd2hchcw766fy

Eliminating synchronization bottlenecks using adaptive replication

Martin C. Rinard, Pedro C. Diniz
2003 ACM Transactions on Programming Languages and Systems  
This article presents a new technique, adaptive replication, for automatically eliminating synchronization bottlenecks in multithreaded programs that perform atomic operations on objects.  ...  We have implemented adaptive replication in the context of a parallelizing compiler for a subset of C++.  ...  ACKNOWLEDGMENTS We would like to the anonymous referees of various versions of this article for their thoughtful and helpful comments.  ... 
doi:10.1145/641909.641911 fatcat:6ftcwn2lbbc3vhv2qb7spqujfm

Analysis of Task Offloading for Accelerators [chapter]

Roger Ferrer, Vicenç Beltran, Marc Gonzàlez, Xavier Martorell, Eduard Ayguadé
2010 Lecture Notes in Computer Science  
Overall, our performance is better compared to the results obtained from the IBM compiler for the Cell processor.  ...  for programmers to offload parts of their applications to the auxiliary processors.  ...  Acknowledgements We would like to thank the Barcelona Supercomputing Center (BSC) for the use of their machines.  ... 
doi:10.1007/978-3-642-11515-8_24 fatcat:d3kcmhhhznakfeivzung7346r4

Mapping and optimization of the AVS video decoder on a high performance chip multiprocessor

Konstantinos Krommydas, George Tsoublekas, Christos D. Antonopoulos, Nikolaos Bellas
2010 2010 IEEE International Conference on Multimedia and Expo  
The input dependent variability of execution time per work chunk is addressed using dynamic scheduling to allocate work to each thread.  ...  This paper presents the implementation, optimization and characterization of the AVS video decoder on Intel Core i7, a quad-core, hyper-threaded, chip multiprocessor (CMP).  ...  threads), multithreaded version (4 threads) and final, vectorized, multithreaded code (4 threads).  ... 
doi:10.1109/icme.2010.5582558 dblp:conf/icmcs/KrommydasTAB10 fatcat:36fjeahcfvayngwne4id6t5orm

Scheduling threads for constructive cache sharing on CMPs

Shimin Chen, Todd C. Mowry, Chris Wilkerson, Phillip B. Gibbons, Michael Kozuch, Vasileios Liaskovitis, Anastassia Ailamaki, Guy E. Blelloch, Babak Falsafi, Limor Fix, Nikos Hardavellas
2007 Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures - SPAA '07  
Many multithreaded programs provide opportunities for constructive cache sharing, in which concurrently scheduled threads share a largely overlapping working set.  ...  In this paper, we compare the performance of two state-of-the-art schedulers proposed for fine-grained multithreaded programs: Parallel Depth First (PDF), which is specifically designed for constructive  ...  Using Profiling Information for Automatic Task Coarsening The automatic task coarsening algorithm traverses the task group tree from top to bottom and evaluates a heuristic stop criterion at every node  ... 
doi:10.1145/1248377.1248396 dblp:conf/spaa/ChenGKLABFFHMW07 fatcat:7zuvfmkmorbzzdwlmkdl5pmwa4

Effects of Ordering Strategies and Programming Paradigms on Sparse Matrix Computations

Leonid Oliker, Xiaoye Li, Parry Husbands, Rupak Biswas
2002 SIAM Review  
Thanks alsoto BruceHendrickson andtheotheranonymous referees for theirsuggestionsthat helped improve thepaper.  ...  promising strategy for hybrid architectures.  ...  ity for PCG due to a lack of thread level par_!elism.  ... 
doi:10.1137/s00361445003820 fatcat:o7uwxsbcfnf4ppcfc2sbpxncmm

Dynamic tiling for effective use of shared caches on multithreaded processors

Dimitrios S. Nikolopoulos
2004 International Journal of High Performance Computing and Networking  
The key idea is to use two tile sizes in the program, one for single-threaded execution mode and one suitable for multithreaded execution mode and switch between tile sizes at runtime.  ...  Simultaneous multithreaded (SMT) processors use data caches which are dynamically shared between threads.  ...  Acknowledgement The authors would like to thank the IJHPCN referees for several helpful suggestions. A preliminary version of this work has been published (Nikolopoulos, 2003) .  ... 
doi:10.1504/ijhpcn.2004.009265 fatcat:as5smmoulnaavnchbcokabjbmu

Compiler generation and autotuning of communication-avoiding operators for geometric multigrid

Protonu Basu, Anand Venkat, Mary Hall, Samuel Williams, Brian Van Straalen, Leonid Oliker
2013 20th Annual International Conference on High Performance Computing  
identify the best implementation for a particular architecture and at each computation phase.  ...  To make the approach portable, an underlying autotuning system explores the tradeoff between reduced communication and increased computation, as well as tradeoffs in threading schemes, to automatically  ...  As modern architectures continue to grow in core count and exhibit a hierarchy of complex inter-thread and inter-process interactions, new communication-avoiding techniques have been introduced for GMG  ... 
doi:10.1109/hipc.2013.6799131 dblp:conf/hipc/BasuVHWSO13 fatcat:e7jdczjskbdh7buasco2fhrm7m
« Previous Showing results 1 — 15 out of 170 results