Filters








62 Hits in 6.0 sec

Parallel depth first vs. work stealing schedulers on CMP architectures

Vasileios Liaskovitis, Todd C. Mowry, Chris Wilkerson, Shimin Chen, Phillip B. Gibbons, Anastassia Ailamaki, Guy E. Blelloch, Babak Falsafi, Limor Fix, Nikos Hardavellas, Michael Kozuch
2006 Proceedings of the eighteenth annual ACM symposium on Parallelism in algorithms and architectures - SPAA '06  
In this brief announcement, we highlight our ongoing study [4] comparing the performance of two schedulers designed for fine-grained multithreaded programs: Parallel Depth First (PDF) [2] , which is  ...  Figure 1 : 1 PDF vs. WS for parallel merge sort  ... 
doi:10.1145/1148109.1148167 dblp:conf/spaa/LiaskovitisCGABFFHKMW06 fatcat:prva2z5usfeappjbmspw6lcqg4

Scheduling threads for constructive cache sharing on CMPs

Shimin Chen, Todd C. Mowry, Chris Wilkerson, Phillip B. Gibbons, Michael Kozuch, Vasileios Liaskovitis, Anastassia Ailamaki, Guy E. Blelloch, Babak Falsafi, Limor Fix, Nikos Hardavellas
2007 Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures - SPAA '07  
In this paper, we compare the performance of two state-of-the-art schedulers proposed for fine-grained multithreaded programs: Parallel Depth First (PDF), which is specifically designed for constructive  ...  Our experimental results indicate that PDF scheduling yields a 1.3-1.6X performance improvement relative to WS for several fine-grain parallel benchmarks on projected future CMP configurations; we also  ...  Figure 2 : 2 Parallel Depth First vs. Work Stealing with default CMP configurations. Figure 3 : 3 Parallel Depth First vs. Work Stealing under a single technology (45nm).  ... 
doi:10.1145/1248377.1248396 dblp:conf/spaa/ChenGKLABFFHMW07 fatcat:7zuvfmkmorbzzdwlmkdl5pmwa4

Carbon

Sanjeev Kumar, Christopher J. Hughes, Anthony Nguyen
2007 SIGARCH Computer Architecture News  
Carbon delivers significant performance improvements over the best software scheduler: on average for 64 cores, 68% faster on a set of loop-parallel benchmarks, and 109% faster on a set of task-parallel  ...  Chip multiprocessors (CMPs) are now commonplace, and the number of cores on a CMP is likely to grow steadily.  ...  Typically, a LIFO, or depth-first, order has better cache locality (and smaller working set) while a FIFO, or breadth-first, order exposes more parallelism. • Simplify multithreading: A simple task queuing  ... 
doi:10.1145/1273440.1250683 fatcat:vy3llevlmrhyrl64v5pxhie2vu

Carbon

Sanjeev Kumar, Christopher J. Hughes, Anthony Nguyen
2007 Proceedings of the 34th annual international symposium on Computer architecture - ISCA '07  
Carbon delivers significant performance improvements over the best software scheduler: on average for 64 cores, 68% faster on a set of loop-parallel benchmarks, and 109% faster on a set of task-parallel  ...  Chip multiprocessors (CMPs) are now commonplace, and the number of cores on a CMP is likely to grow steadily.  ...  Typically, a LIFO, or depth-first, order has better cache locality (and smaller working set) while a FIFO, or breadth-first, order exposes more parallelism. • Simplify multithreading: A simple task queuing  ... 
doi:10.1145/1250662.1250683 dblp:conf/isca/KumarHN07 fatcat:6c4xq3g6dfczjid7r6aqxw24ze

Synchronization Using Remote-Scope Promotion

Marc S. Orr, Shuai Che, Ayse Yilmazer, Bradford M. Beckmann, Mark D. Hill, David A. Wood
2015 Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS '15  
Compared to a naïve baseline, static scoped synchronization alone achieves a 1.07x speedup on average and dynamic work stealing alone achieves a 1.18x speedup on average.  ...  It works poorly for dynamic sharing patterns (e.g., work stealing) where programmers cannot use a faster small scope due to the rare possibility that the work is stolen by a thread in a distant slower  ...  Steal-only: The third algorithm, called steal-only, improves on baseline by replacing its static scheduling algorithm with work stealing.  ... 
doi:10.1145/2694344.2694350 dblp:conf/asplos/OrrCYBHW15 fatcat:aq4uu7aa6revzft5i3nbnmqdoq

Synchronization Using Remote-Scope Promotion

Marc S. Orr, Shuai Che, Ayse Yilmazer, Bradford M. Beckmann, Mark D. Hill, David A. Wood
2015 SIGARCH Computer Architecture News  
Compared to a naïve baseline, static scoped synchronization alone achieves a 1.07x speedup on average and dynamic work stealing alone achieves a 1.18x speedup on average.  ...  It works poorly for dynamic sharing patterns (e.g., work stealing) where programmers cannot use a faster small scope due to the rare possibility that the work is stolen by a thread in a distant slower  ...  Steal-only: The third algorithm, called steal-only, improves on baseline by replacing its static scheduling algorithm with work stealing.  ... 
doi:10.1145/2786763.2694350 fatcat:uopjp6rwifhr7j7x7u42agdf5y

Hardware/software support for adaptive work-stealing in on-chip multiprocessor

Quentin Meunier, Frédéric Pétrot, Jean-Louis Roch
2010 Journal of systems architecture  
To deal in a portable way with MPSoCs having a different number of processors running possibly at different frequencies, Work Stealing (WS) based parallelization is a current research trend.  ...  The previous evaluations of WS, either theoretical or experimental, were done on fixed multicores architectures.  ...  Considering CMPs, [25] focuses on the number of cache misses by comparing the performance of two different implementations of work stealing, i.e. traditional (WS) and Depth first (PDF).  ... 
doi:10.1016/j.sysarc.2010.06.007 fatcat:2qu6f5nuw5cgflfwautmxflcay

Scalability of Macroblock-level Parallelism for H.264 Decoding

Mauricio Alvarez Mesa, Alex Ramírez, Arnaldo Azevedo, Cor Meenderinck, Ben Juurlink, Mateo Valero
2009 2009 15th International Conference on Parallel and Distributed Systems  
Second, an implementation on a real multiprocessor architecture including a comparison of different scheduling strategies and a profiling analysis for identifying the performance bottlenecks.  ...  First, a formal model for predicting the maximum performance that can be obtained taking into account variable processing time of tasks and thread synchronization overhead.  ...  ACKNOWLEDGMENT This work has been supported by the European Commission in the context of the SARC project (contract no. 27648), and the Spanish Ministry of Education (contract no. TIN2007-60625).  ... 
doi:10.1109/icpads.2009.124 dblp:conf/icpads/MesaRAMJV09 fatcat:zlse7gtwarbuvekn4z46aiopty

Effectively sharing a cache among threads

Guy E. Blelloch, Phillip B. Gibbons
2004 Proceedings of the sixteenth annual ACM symposium on Parallelism in algorithms and architectures - SPAA '04  
The parallel schedule we study is a parallel depth-first schedule (pdfschedule) based on the sequential one. The schedule is greedy and therefore work-efficient.  ...  We model a computation as a dag and the sequential execution as a depth first schedule of the dag.  ...  LRU Cache W1 work with a sequential schedule WI work with Sequential Ideal Cache WL work with Sequential LRU Cache Wp work with a parallel schedule dp depth with a parallel schedule δ (≥ 0) number of  ... 
doi:10.1145/1007912.1007948 dblp:conf/spaa/BlellochG04 fatcat:ajtbkagn45dbni3e4otsiyotf4

Implementation and performance aspects of Kahn process networks

Zeljko Vrba
2010 ACM SIGMultimedia Records  
Lastly, we use Nornir to evaluate several load-balancing methods: static assignment, work-stealing, our improvement of work-stealing, and a method based on graph partitioning.  ...  KPNs are a model of concurrency that relies exclusively on message passing, and that has some advantages over parallel programming tools in wide use today: simplicity, graphical representation, and determinism  ...  .: Scatter/gather on 8 CPUs with 16 workers: speedup and steal rate vs. work per message.  ... 
doi:10.1145/1874413.1874418 fatcat:sm5wsqpyyfevtarf6336elhyny

Task Superscalar: An Out-of-Order Task Pipeline

Yoav Etsion, Felipe Cabarcas, Alejandro Rico, Alex Ramirez, Rosa M. Badia, Eduard Ayguade, Jesus Labarta, Mateo Valero
2010 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture  
Keywords-Out-of-order execution, CMP/manycore, task superscalar, parallel programming 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture  ...  This configuration achieves speedups of 95-255x (average 183x) over sequential execution for nine scientific benchmarks, running on a simulated CMP with 256 cores.  ...  The move to parallel architectures, coupled with cumber- someness of existing parallel programming models, raised interest in task-based models.  ... 
doi:10.1109/micro.2010.13 dblp:conf/micro/EtsionCRRBALV10 fatcat:yv3y4z5b2je7tev23rndrmuoeu

Study and evaluation of an Irregular Graph Algorithm on Multicore and GPU Processor Architectures [article]

Varun Nagpal
2016 arXiv   pre-print
To the best of our knowledge, this algorithm has only been accelerated on supercomputer class computer named Cray XMT and no work exists that demonstrates performance evaluation and comparison of this  ...  on CMP's.  ...  In Heterogeneous CMP architectures, different types of cores are integrated on the same chip.  ... 
arXiv:1603.02655v1 fatcat:nklt3op66vdfdpmd3ygeckhwla

Efficient memory layout for packet classification system on multi-core architecture

Shariful Hasan Shaikot, Min Sik Kim
2012 2012 IEEE Global Communications Conference (GLOBECOM)  
Bakken for serving on my committee.  ...  He taught me the way to see the light at the end of the tunnel and also as to how to conduct research and guided me throughout this work. It is very rewarding to work with Dr. Kim.  ...  First, the tree depth depends on the distribution of the rules in the rule space.  ... 
doi:10.1109/glocom.2012.6503501 dblp:conf/globecom/ShaikotK12 fatcat:xk7ua5ldpbfsxghuuzb63tvtsq

Reconstructing Hardware Transactional Memory for Workload Optimized Systems [chapter]

Kunal Korgaonkar, Prabhat Jain, Deepak Tomar, Kashyap Garimella, Veezhinathan Kamakoti
2011 Lecture Notes in Computer Science  
The two-day technical program of APPT 2011 provided an excellent venue capturing the state of the art and practice in parallel architectures, parallel software and distributed and cloud computing.  ...  This creates grand challenges to architectural and system designs, as well as to methods of programming these systems, which form the core theme of APPT 2011.  ...  To schedule numbers of activities, the X10 runtime provides a work-stealing algorithm that balances the execution of activities on different workers.  ... 
doi:10.1007/978-3-642-24151-2_1 fatcat:32cx745cn5cfdm5sbeah6eyiey

Software challenges in extreme scale systems

Vivek Sarkar, William Harrod, Allan E Snavely
2009 Journal of Physics, Conference Series  
probably represents the first true parallel processor to fly in space, and one of the earliest examples of multi-threaded architectures.  ...  His Ph.D. thesis on the parallel solution of recurrence equations was one of the early works on what is now called parallel prefix, and applications of those results are still acknowledged as defining  ...  • Challenge: scheduling with bounded resources (adapt across eager vs. lazy scheduling for starvation vs. contention modes) • Related topics: control vs data-driven initiation/termination of tasks  ... 
doi:10.1088/1742-6596/180/1/012045 fatcat:iukutry2dvbitfdh6ng7kgz564
« Previous Showing results 1 — 15 out of 62 results