Filters








100 Hits in 4.1 sec

SHARED MEMORY VERSUS MESSAGE PASSING FOR ITERATIVE SOLUTION OF SPARSE, IRREGULAR PROBLEMS

FREDERIC T. CHONG, ANANT AGARWAL
1999 Parallel Processing Letters  
The benefits of hardware support for shared memory versus those for message passing are difficult to evaluate without an in-depth study of real applications on a common platform.  ...  We find that machines with fast global memory operations do not need message passing or bulk transfer to support our irregular problems. This is primarily due to three reasons.  ...  Yeung and Agarwal [YA93] explored fine-grain synchronization and language support for preconditioned conjugate gradient on regular problems on Alewife.  ... 
doi:10.1142/s0129626499000177 fatcat:ja24n7c6w5ghjk3jfiq3dqmzgq

A Survey on Hardware and Software Support for Thread Level Parallelism [article]

Somnath Mazumdar, Roberto Giorgi
2016 arXiv   pre-print
We also further discuss on software support for threads, to mainly increase the deterministic behavior during runtime.  ...  Hardware support at execution time is very crucial to the performance of the system, thus different types of hardware support for threads also exist or have been proposed, primarily based on widely used  ...  Some of the PowerPC based processor models support coarse grain as well as fine grain multithreading.  ... 
arXiv:1603.09274v3 fatcat:75isdvgp5zbhplocook6273sq4

A Survey: Runtime Software Systems for High Performance Computing

2017 Supercomputing Frontiers and Innovations  
Many share common properties such as multi-tasking either preemptive or non-preemptive, message-driven computation such as active messages, sophisticated fine-grain synchronization such as dataflow and  ...  These methods they are principally coarse grained and statically scheduled.  ...  Many applications are now far more sophisticated than this, combining irregular and time-varying data structures with medium to fine grained tasks to expose an abundance of parallelism for greater scalability  ... 
doi:10.14529/jsfi170103 fatcat:yqj65kpvhngovcmgrr46vwwr6i

Executing Optimized Irregular Applications Using Task Graphs within Existing Parallel Models

Christopher D. Krieger, Michelle Mills Strout, Jonathan Roelofs, Amanreet Bajwa
2012 2012 SC Companion: High Performance Computing, Networking Storage and Analysis  
Many sparse or irregular scientific computations are memory bound and benefit from locality improving optimizations such as blocking or tiling.  ...  We present performance and scalability results for 8 and 40 core shared memory systems on a sparse matrix iterative solver and a molecular dynamics benchmark.  ...  This project is supported by the CSCAPES Institute, which is supported by the U.S.  ... 
doi:10.1109/sc.companion.2012.43 dblp:conf/sc/KriegerSRB12 fatcat:6yfyqol2knajpix5fj5k4pegoi

LoGPC

Csaba Andras Moritz, Matthew I. Frank
1998 Performance Evaluation Review  
AbstractÐIn many real applications, for example, those with frequent and irregular communication patterns or those using large messages, network contention and contention for message processing resources  ...  We validate LoGPC by analyzing three applications implemented with Active Messages [11] , [19] on the MIT Alewife multiprocessor.  ...  This also gives us an indication on performance improvements obtainable with improved mapping or communication locality for very fine-grained applications with large messages.  ... 
doi:10.1145/277858.277933 fatcat:ddcgz6e7wrfdtcddsmao53lfzq

Scheduling threads for low space requirement and good locality

Girija J. Narlikar
1999 Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures - SPAA '99  
For a nested-parallel program with depth and serial space requirement is a user-adjustable runtime parameter, which provides a trade-off between running time and space requirement.  ...  At a fine thread granularity, our scheduler outperforms both these previous schedulers, but requires marginally more memory than the depth-first scheduler.  ...  We also thank Adam Kalai and Avrim Blum for useful discussions.  ... 
doi:10.1145/305619.305629 dblp:conf/spaa/Narlikar99 fatcat:bmulxu7zm5bsbknvr2637l4gwm

Scheduling Threads for Low Space Requirement and Good Locality

G. J. Narlikar
2002 Theory of Computing Systems  
For a nested-parallel program with depth and serial space requirement is a user-adjustable runtime parameter, which provides a trade-off between running time and space requirement.  ...  At a fine thread granularity, our scheduler outperforms both these previous schedulers, but requires marginally more memory than the depth-first scheduler.  ...  We also thank Adam Kalai and Avrim Blum for useful discussions.  ... 
doi:10.1007/s00224-001-1030-6 fatcat:sy3gyvncnnf2veeve6qparpl2q

LoGPC

Csaba Andras Moritz, Matthew I. Frank
1998 Proceedings of the 1998 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems - SIGMETRICS '98/PERFORMANCE '98  
AbstractÐIn many real applications, for example, those with frequent and irregular communication patterns or those using large messages, network contention and contention for message processing resources  ...  We validate LoGPC by analyzing three applications implemented with Active Messages [11], [19] on the MIT Alewife multiprocessor.  ...  This also gives us an indication on performance improvements obtainable with improved mapping or communication locality for very fine-grained applications with large messages.  ... 
doi:10.1145/277851.277933 dblp:conf/sigmetrics/MoritzF98 fatcat:sos664me45aenhnkgiuyssameq

LoGPG: Modeling network contention in message-passing programs

C.A. Moritz, M.I. Frank
2001 IEEE Transactions on Parallel and Distributed Systems  
AbstractÐIn many real applications, for example, those with frequent and irregular communication patterns or those using large messages, network contention and contention for message processing resources  ...  We validate LoGPC by analyzing three applications implemented with Active Messages [11], [19] on the MIT Alewife multiprocessor.  ...  This also gives us an indication on performance improvements obtainable with improved mapping or communication locality for very fine-grained applications with large messages.  ... 
doi:10.1109/71.920589 fatcat:h3arcgip7rhs5n2hxy3l3ngieq

Scheduling threads for constructive cache sharing on CMPs

Shimin Chen, Todd C. Mowry, Chris Wilkerson, Phillip B. Gibbons, Michael Kozuch, Vasileios Liaskovitis, Anastassia Ailamaki, Guy E. Blelloch, Babak Falsafi, Limor Fix, Nikos Hardavellas
2007 Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures - SPAA '07  
In this paper, we compare the performance of two state-of-the-art schedulers proposed for fine-grained multithreaded programs: Parallel Depth First (PDF), which is specifically designed for constructive  ...  In chip multiprocessors (CMPs), limiting the number of offchip cache misses is crucial for good performance.  ...  all the fine-grained parallel programs studied.  ... 
doi:10.1145/1248377.1248396 dblp:conf/spaa/ChenGKLABFFHMW07 fatcat:7zuvfmkmorbzzdwlmkdl5pmwa4

The Paradigm compiler for distributed-memory multicomputers

P. Banerjee, J.A. Chandy, M. Gupta, E.W. Hodges, J.G. Holm, A. Lain, D.J. Palermo, S. Ramaswamy, E. Su
1995 Computer  
A unified approach efficiently supports regular and irregular computations using data and functional parallelism.  ...  The Paradigm (Parallelizing Compiler for Distributed-Memory, General-Purpose Multicomputers) project at the University of Illinois addresses this problem by developing automatic methods for efficient parallelization  ...  We are also grateful to the National Center for Supercomputing Applications, the San Diego Supercomputing Center, and the Argonne National Laboratory for providing access to their machines.  ... 
doi:10.1109/2.467577 fatcat:ghmtervcfzehzlelvf2ealwgyu

Rammer: Enabling Holistic Deep Learning Compiler Optimizations with rTasks

Lingxiao Ma, Zhiqiang Xie, Zhi Yang, Jilong Xue, Youshan Miao, Wei Cui, Wenxiang Hu, Fan Yang, Lintao Zhang, Lidong Zhou
2020 USENIX Symposium on Operating Systems Design and Implementation  
We implement RAMMER for multiple hardware backends such as NVIDIA GPUs, AMD GPUs, and Graphcore IPU.  ...  RAM-MER generates an efficient static spatio-temporal schedule for a DNN at compile time to minimize scheduling overhead.  ...  Jinyang Li, for their extensive suggestions. We thank Jim Jernigan and Kendall Martin from the Microsoft Grand Central Resources team for the support of GPUs.  ... 
dblp:conf/osdi/MaXYXMCHYZZ20 fatcat:5f246j7p3fdqvof5e2w2ubi7iu

CASCH: a tool for computer-aided scheduling

I. Ahmad, Yu-Kwong Kwok, Min-You Wu, Wei Shu
2000 IEEE Concurrency  
A partial taxonomy of the multiprocessor-scheduling problem.  ...  ACKNOWLEDGMENTS We thank the referees for their constructive and insightful comments that have greatly improved the presentation of this article. We  ...  Her current interests include dynamic scheduling, resource management, runtime support systems for parallel and distributed processing, multimedia networking, and operating system support for large-scale  ... 
doi:10.1109/4434.895101 fatcat:ccjjnig47bfivkelhqxa5wbjb4

Palirria: accurate on-line parallelism estimation for adaptive work-stealing

Georgios Varisteas, Mats Brorsson
2015 Concurrency and Computation  
We implemented Palirria for both the Linux and Barrelfish operating systems and evaluated it on two platforms: a 48-core NUMA multiprocessor and a simulated 32-core system.  ...  The estimation mechanism is optimized for accuracy, minimizing the requested resources without degrading performance.  ...  Their parallelism profiles range from the fine grained with a wide and balanced tree nQueens, to the quite irregular and coarser grained Strassen.  ... 
doi:10.1002/cpe.3630 fatcat:rct6gsyqrjcfpekgjz6wpqxsou

A Task-Centric Memory Model for Scalable Accelerator Architectures

John H. Kelm, Daniel R. Johnson, Steven S. Lumetta, Matthew I. Frank, Sanjay J. Patel
2009 2009 18th International Conference on Parallel Architectures and Compilation Techniques  
hardware coherence support.  ...  This paper presents a task-centric memory model for 1000-core compute accelerators.  ...  Johnson, Aqeel Mahesri, and the anonymous referees for their input and feedback. John Kelm was partially supported by a fellowship from ATI/AMD.  ... 
doi:10.1109/pact.2009.16 dblp:conf/IEEEpact/KelmJLFP09 fatcat:6jxpwblpprhzrcqf7xmirztkti
« Previous Showing results 1 — 15 out of 100 results