Filters








5,154 Hits in 5.7 sec

Reducing the burden of parallel loop schedulers for many‐core processors

Mahwish Arif, Hans Vandierendonck
2021 Concurrency and Computation  
This article enhances the scalability of parallel loop schedulers by specializing schedulers for fine-grain loops.  ...  Compiler support enables efficient reductions for Cilk, without changing the programming interface of Cilk reducers.  ...  The speedup grows with increasing thread count, indicating that future many-core processors will be even more susceptible to scheduler burden.  ... 
doi:10.1002/cpe.6241 fatcat:4rluruunxjb4dehant4kjl354e

Reducing the burden of parallel loop schedulers for many‐core processors

Mahwish Arif, Hans Vandierendonck, Apollo-University Of Cambridge Repository
2021
This article enhances the scalability of parallel loop schedulers by specializing schedulers for fine‐grain loops.  ...  Compiler support enables efficient reductions for Cilk, without changing the programming interface of Cilk reducers.  ...  The speedup grows with increasing thread count, indicating that future many-core processors will be even more susceptible to scheduler burden.  ... 
doi:10.17863/cam.71347 fatcat:m4y6hdgbfff2nfdvmkophzusui

The Cilkview scalability analyzer

Yuxiong He, Charles E. Leiserson, William M. Leiserson
2010 Proceedings of the 22nd ACM symposium on Parallelism in algorithms and architectures - SPAA '10  
In addition, Cilkview analyzes scheduling overhead using the concept of a "burdened dag," which allows it to diagnose performance problems in the application due to an insufficient grain size of parallel  ...  These metrics allow Cilkview to estimate parallelism and predict how the application will scale with the number of processing cores.  ...  The Cilk++ parallel memcpy replaces the ÓÖ loop of the serial implementation with a Ð ÓÖ loop to enable parallelism.  ... 
doi:10.1145/1810479.1810509 dblp:conf/spaa/HeLL10 fatcat:mspvwpghfnahba3hfar5vidxqq

Research on the construction and simulation of PO-Dijkstra algorithm model in parallel network of multicore platform

Bo Zhang, De Ji Hu
2020 EURASIP Journal on Wireless Communications and Networking  
The development of multicore hardware has provided many new development opportunities for many application software algorithms.  ...  Using "divide by data" will reduce the cost and management difficulty of real-time temperature. Using "divide by function" is a good choice for streaming media data.  ...  Acknowledgements No Authors' contributions Bo Zhang is responsible for the experimental part of the article, and DeJi Hu is responsible for the theoretical part of the article.  ... 
doi:10.1186/s13638-020-01680-x fatcat:6ntzysupyjcptdgudj2poaewfm

Implementing communications systems on an SDR SoC

John Glossner, Daniel Iancu, Mayan Moudgill, Sanjay Jinturkar, Gary Nacer, Stuart Stanley, Andrei Iancu, Hua Ye, Michael Schulte, Mihai Sima, Tomas Palenik, Peter Farkas (+1 others)
2008 Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing  
In this paper we present techniques for implementing communications systems in software. We describe briefly the SB3011 platform and programming environment.  ...  Software Defined Radios (SDRs) offer a programmable and dynamically reconfigurable method of reusing hardware to implement the physical layer processing of multiple communications systems.  ...  To enable physical layer processing in software, processors should support many levels of parallelism.  ... 
doi:10.1109/icassp.2008.4518876 dblp:conf/icassp/GlossnerIMJNSIYSSPFT08 fatcat:lsvfngv4dfaudcskavuffu25zu

Lazy binary-splitting

Alexandros Tzannes, George C. Caragea, Rajeev Barua, Uzi Vishkin
2010 Proceedings of the 15th ACM SIGPLAN symposium on Principles and practice of parallel programming - PPoPP '10  
Besides being tedious, this tuning also over-fits the code to some particular dataset, platform and calling context of the do-all loop, resulting in poor performance portability for the code.  ...  This threshold limits the parallelism and prevents excessive overheads for finegrain parallelism.  ...  Motivation for Dynamic Scheduling Static scheduling of doall loops is easy: the number of iterations can be divided by the number of processors at run-time to yield how many iterations each processor should  ... 
doi:10.1145/1693453.1693479 dblp:conf/ppopp/TzannesCBV10 fatcat:j3x6vvurtrhvzj53253riur3ee

Lazy binary-splitting

Alexandros Tzannes, George C. Caragea, Rajeev Barua, Uzi Vishkin
2010 SIGPLAN notices  
Besides being tedious, this tuning also over-fits the code to some particular dataset, platform and calling context of the do-all loop, resulting in poor performance portability for the code.  ...  This threshold limits the parallelism and prevents excessive overheads for finegrain parallelism.  ...  Motivation for Dynamic Scheduling Static scheduling of doall loops is easy: the number of iterations can be divided by the number of processors at run-time to yield how many iterations each processor should  ... 
doi:10.1145/1837853.1693479 fatcat:sm26nqo3ifhonndv6veqtoijhi

Machine learning based online performance prediction for runtime parallelization and task scheduling

Jiangtian Li, Xiaosong Ma, Karan Singh, Martin Schulz, Bronis R. de Supinski, Sally A. McKee
2009 2009 IEEE International Symposium on Performance Analysis of Systems and Software  
With the emerging many-core paradigm, parallel programming must extend beyond its traditional realm of scientific applications.  ...  However, many systems lack a priori knowledge about the execution time of all tasks to perform effective load balancing with low scheduling overhead.  ...  We are thankful to Hao (Helen) Zhang and Mihye Ahn from Department of Statistics at NCSU for providing the LMM application for our evaluation.  ... 
doi:10.1109/ispass.2009.4919641 dblp:conf/ispass/LiMSSSM09 fatcat:egyj4zpw2bas3guozkm3jtufqa

Exploiting inter-thread temporal locality for chip multithreading

Jiayuan Meng, Jeremy W Sheaffer, Kevin Skadron
2010 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS)  
While this has been studied for concurrent threads with disjoint working sets, the problem has not been addressed for multi-threaded data-parallel workloads in which threads can be scheduled or constructed  ...  This paper proposes the symbiotic affinity scheduling (SAS) algorithm in which work is first partitioned according to the number of cores (i.e., the number of caches), and these partitions are then subdivided  ...  We would like to thank Shuai Che and Jiawei Huang who helped us on the coding of HotSpot and LU for benchmarking, Jie Li who modeled the hardware schedulers in FPGA, and Michael Boyer and Mario Donato  ... 
doi:10.1109/ipdps.2010.5470465 dblp:conf/ipps/MengSS10 fatcat:6b33ba2lmzcnzjo24mlogswgza

Predicting Potential Speedup of Serial Code via Lightweight Profiling and Emulations with Memory Performance Model

Minjang Kim, Pranith Kumar, Hyesoon Kim, Bevin Brett
2012 2012 IEEE 26th International Parallel and Distributed Processing Symposium  
Parallel Prophet models many realistic features of parallel programs: unbalanced workload, multiple critical sections, nested and recursive parallelism, and specific thread schedulings and paradigms, which  ...  With Parallel Prophet, programmers simply insert annotations that describe the parallel behavior of the serial program.  ...  Each benchmark is estimated by (1) the synthesizer without the memory model ('Pred'), (2) the synthesizer with the memory model ('PredM'), and (3) Suitability ('Suit').  ... 
doi:10.1109/ipdps.2012.128 dblp:conf/ipps/KimKKB12 fatcat:hcolajzgqfayzduw4nnb74fkt4

Runtime Aware Architectures

Mateo Valero Cortes
2018 Proceedings of the 2018 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation - SIGSIM-PADS '18  
The runtime of the parallel application has to drive the design of future multi-cores to overcome the restrictions in terms of power, memory, programmability and resilience that multi-cores have.  ...  ) in superscalar processors.  ...  Acknowledgments This work has been partially supported by the Spanish Ministry of Science and Innovation under grant TIN2012-34557, the HiPEAC Network of Excellence, and by the European Research Council  ... 
doi:10.1145/3200921.3204479 dblp:conf/pads/Cortes18 fatcat:ctgvsceil5cgxpba7hhoy5f3ae

Exploiting Both Pipelining and Data Parallelism with SIMD Reconfigurable Architecture [chapter]

Yongjoo Kim, Jongeun Lee, Jinyong Lee, Toan X. Mai, Ingoo Heo, Yunheung Paek
2012 Lecture Notes in Computer Science  
number of cores.  ...  We further present data tiling and evaluate a conflict-free scheduling algorithm as a way to eliminate bank conflicts for a certain class of iteration and data mapping.  ...  Also for large loops with many operations in the loop body, our small core might not be a good match.  ... 
doi:10.1007/978-3-642-28365-9_4 fatcat:n2mauiwx65a2vic6wripfqjkm4

Scheduling task parallelism on multi-socket multicore systems

Stephen L. Olivier, Allan K. Porterfield, Kyle B. Wheeler, Jan F. Prins
2011 Proceedings of the 1st International Workshop on Runtime and Operating Systems for Supercomputers - ROSS '11  
The recent addition of task parallelism to the OpenMP shared memory API allows programmers to express concurrency at a high level of abstraction and places the burden of scheduling parallel execution on  ...  For cores on the same chip, a shared LIFO queue allows exploitation of cache locality between sibling tasks as well between a parent task and its newly created child tasks.  ...  Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy's National Nuclear Security Administration under contract DE-AC04  ... 
doi:10.1145/1988796.1988804 fatcat:r7fcxjxulbe7pm66zacsdn2gam

The Cilk++ concurrency platform

Charles E. Leiserson
2009 Proceedings of the 46th Annual Design Automation Conference on ZZZ - DAC '09  
The availability of multicore processors across a wide range of computing platforms has created a strong demand for software frameworks that can harness these resources.  ...  The Cilk++ runtime system guarantees to load-balance computations effectively.  ...  Thanks to Patrick Madden of SUNY Binghamton for proposing extensive revisions to the original manuscript.  ... 
doi:10.1145/1629911.1630048 dblp:conf/dac/Leiserson09 fatcat:5oenlyp7gvfidgh2snrrik7vdi

Multicore compilation strategies and challenges

Mojtaba Mehrara, Thomas Jablin, Dan Upton, David August, Kim Hazelwood, Scott Mahlke
2009 IEEE Signal Processing Magazine  
This article provides an overview of parallelism and compiler technology to help the community understand the software development challenges and opportunities for multicore signal processors.  ...  The burden is placed on software developers and tools to find and exploit coarse-grain parallelism to effectively make use of the abundance of computing resources provided by these systems.  ...  Many new languages have been proposed to ease the burden of writing parallel programs, including Atomos, Cilk, and StreamIt.  ... 
doi:10.1109/msp.2009.934117 fatcat:xacwdf6mljdvnafkb3m5kfjlfu
« Previous Showing results 1 — 15 out of 5,154 results