1,103 Hits in 11.0 sec

Language and Compiler Support for Out-of-Core Irregular Applications on Distributed-Memory Multiprocessors [chapter]

Peter Brezany, Alok Choudhary, Minh Dang
1998 Lecture Notes in Computer Science  
on out-of-core data.  ...  A promising approach is to develop a language support and a compiler system on top of an advanced runtime system which can automatically transform an appropriate in-core program to efficiently operate  ...  However, the efficient parallelization of irregular applications for distributed-memory multiprocessors (DMMPs) is still a challenging problem.  ... 
doi:10.1007/3-540-49530-4_25 fatcat:s2dxsxjb5jac3oyxlrc6qrzqwy

Automatic generation of application-specific accelerators for FPGAs from python loop nests

David Sheffield, Michael Anderson, Kurt Keutzer
2012 22nd International Conference on Field Programmable Logic and Applications (FPL)  
Design space exploration on the FPGA proceeds by varying the number of PEs in the system. Over four benchmark kernels, our system achieves 3× to 6× relative to soft-core C performance.  ...  Our system applies traditional dependence analysis and reordering transformations to a restricted set of Python loop nests.  ...  To support compilation on GPUs, Copperhead makes several restrictions required for compilation.  ... 
doi:10.1109/fpl.2012.6339372 dblp:conf/fpl/SheffieldAK12 fatcat:hphpwnv4uvdkxlwhptkr6p7ery

Scheduling Dynamic OpenMP Applications over Multicore Architectures [chapter]

François Broquedis, François Diakhaté, Samuel Thibault, Olivier Aumage, Raymond Namyst, Pierre-André Wacrenier
2008 Lecture Notes in Computer Science  
We achieve a speedup of 14 on a 16-core machine with no application-level optimization.  ...  data to the underlying runtime system, most OpenMP runtime systems are actually unable to efficiently support highly irregular, massively parallel applications on NUMA machines.  ...  Parallel languages such as OpenMP, that rely on the combination of a dedicated compiler and a set of code annotations to extract the parallel structure of applications and to generate scheduling hints  ... 
doi:10.1007/978-3-540-79561-2_15 fatcat:n5pkgkq7jzhhpostmjt4xt4oje

Evaluating the Impact of Programming Language Features on the Performance of Parallel Applications on Cluster Architectures [chapter]

Konstantin Berlin, Jun Huan, Mary Jacob, Garima Kochhar, Jan Prins, Bill Pugh, P. Sadayappan, Jaime Spacco, Chau-Wen Tseng
2004 Lecture Notes in Computer Science  
We compare a number of programming languages (Pthreads, OpenMP, MPI, UPC, Global Arrays) on both shared and distributed-memory architectures.  ...  We evaluate the impact of programming language features on the performance of parallel applications on modern parallel architectures, particularly for the demanding case of sparse integer codes.  ...  Our conclusion is that parallel applications requiring fine-grain accesses achieve poor performance on clusters regardless of the programming paradigm or language feature used, because the amount of inherent  ... 
doi:10.1007/978-3-540-24644-2_13 fatcat:js24djykkfhohk2gmc2m4dmbdu


Seyong Lee, Seung-Jai Min, Rudolf Eigenmann
2008 Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming - PPoPP '09  
regular and irregular applications, leading to performance improvements of up to 50X over the unoptimized translation (up to 328X over serial on a CPU).  ...  This paper presents a compiler framework for automatic source-to-source translation of standard OpenMP applications into CUDA-based GPGPU applications.  ...  Acknowledgments This work was supported, in part, by the National Science Foundation under grants No. 0429535-CCF, CNS-0751153, and 0833115-CCF.  ... 
doi:10.1145/1504176.1504194 dblp:conf/ppopp/LeeME09 fatcat:7ru27sozu5h5hhlni4w4cdx6hi


Seyong Lee, Seung-Jai Min, Rudolf Eigenmann
2009 SIGPLAN notices  
regular and irregular applications, leading to performance improvements of up to 50X over the unoptimized translation (up to 328X over serial on a CPU).  ...  This paper presents a compiler framework for automatic source-to-source translation of standard OpenMP applications into CUDA-based GPGPU applications.  ...  Acknowledgments This work was supported, in part, by the National Science Foundation under grants No. 0429535-CCF, CNS-0751153, and 0833115-CCF.  ... 
doi:10.1145/1594835.1504194 fatcat:wbpl7ohbzffedndc6s6tafkfny

A Survey on Hardware and Software Support for Thread Level Parallelism [article]

Somnath Mazumdar, Roberto Giorgi
2016 arXiv   pre-print
We also review the programming models with respect to their support to shared-memory, distributed-memory and heterogeneity.  ...  Todays computers are built upon multiple processing cores and run applications consisting of a large number of threads, making runtime thread management a complex process.  ...  TRIPS supports TLP and DLP on a single threaded application using its four, 16-wide, out-of-order cores.  ... 
arXiv:1603.09274v3 fatcat:75isdvgp5zbhplocook6273sq4

Application Specific Customization and Scalability of Soft Multiprocessors

Deepak Unnikrishnan, Jia Zhao, Russell Tessier
2009 2009 17th IEEE Symposium on Field Programmable Custom Computing Machines  
Streamit -A compiler for stream-based applications Streamit [18] [20] is a high-level, architecture-independent language and compiler targeted at streaming applications.  ...  Each processor requires less on-chip memory to store instructions and data for its application segment. We evaluate the impact of application granularity on on-chip memory later in Chapter 5.  ... 
doi:10.1109/fccm.2009.41 dblp:conf/fccm/UnnikrishnanZT09 fatcat:7cjy7ltl4rcyzlo7e2p4hdecdq

Harnessing Adaptivity Analysis for the Automatic Design of Efficient Embedded and HPC Systems

Silvia Lovergine, Fabrizio Ferrandi
2013 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum  
As a consequence, modern embedded systems exploit the potentialities of hundreds or thousands of processing units, often heterogeneous and physically distributed, which run in parallel on the many-core  ...  Such a scheduling technique, called dynamic AC-scheduling, provides support for the High-Level Synthesis (HLS) of adaptive hardware cores.  ...  GMT provides a set of features to address issues of irregular applications running on distributed memory architectures.  ... 
doi:10.1109/ipdpsw.2013.230 dblp:conf/ipps/LovergineF13 fatcat:vpdgybp2gnbmve6wzgscv6hqoa


Rafael K. V. Maeda, Peng Yang, Xiaowen Wu, Zhe Wang, Jiang Xu, Zhehui Wang, Haoran Li, Luan H. K. Duong, Zhifei Wang
2016 Proceedings of the 1st International Workshop on Advanced Interconnect Solutions and Technologies for Emerging Computing Systems - AISTECS '16  
Opposing to most simulators, JADE uses statistical models that follow distributions extracted from internal structures of the application, providing a more convenient and systematic exploration approach  ...  JADE simulation features include detailed electrical and optical interconnections, detailed memory hierarchy infrastructure, and built-in energy analysis allowing studies of a broad spectrum of systems  ...  Adjustable configurations of Electrical and Optical Network-on-Chip (NoC), memory hierarchy and coherence protocols are supported. We publicly release JADE, available online at [1] .  ... 
doi:10.1145/2857058.2857066 dblp:conf/hipeac/MaedaYWW0WLDW16 fatcat:ogxjh6ztovh2zbsycxt76dnctq

Automatic parallelization of irregular applications

E. Gutiérrez, R. Asenjo, O. Plata, E.L. Zapata
2000 Parallel Computing  
However, there is still a lack of convenient software support for implementing ecient parallel applications.  ...  Both issues are dealt with in depth and in the context of sparse computations (for the ®rst issue) and irregular histogram reductions (for the second issue). Ó  ...  Acknowledgements We gratefully thank David Padua, at the Department of Computer Science, University of Illinois at Urbana-Champaign, for providing us the Polaris compiler, and also Yuan Lin, for the kind  ... 
doi:10.1016/s0167-8191(00)00052-1 fatcat:vdi2bbfgyffu3i4vv62e5zkohm

From Plasma to BeeFarm: Design Experience of an FPGA-Based Multicore Prototype [chapter]

Nehir Sonmez, Oriol Arcas, Gokhan Sayilar, Osman S. Unsal, Adrián Cristal, Ibrahim Hur, Satnam Singh, Mateo Valero
2011 Lecture Notes in Computer Science  
Based on our experience of designing and building a complete FPGA-based multiprocessor emulation system that supports run-time and compiler infrastructure and on the actual executions of our experiments  ...  running Software Transactional Memory (STM) benchmarks, we comment on the pros, cons and future trends of using hardware-based emulation for research.  ...  Introduction This paper reports on our experience of designing and building an eight core cache-coherent shared-memory multiprocessor system on FPGA called BeeFarm to help investigate support for Transactional  ... 
doi:10.1007/978-3-642-19475-7_37 fatcat:eno4vzv2jrdqpjw6ytoqv56cdm

HOMPI: A Hybrid Programming Framework for Expressing and Deploying Task-Based Parallelism [chapter]

Vassilios V. Dimakopoulos, Panagiotis E. Hadjidoukas
2011 Lecture Notes in Computer Science  
This paper presents hompi, a framework for programming and executing task-based parallel applications on clusters of multiprocessors and multi-cores, while providing interoperability with existing programming  ...  systems such as mpi and OpenMP. hompi facilitates expressing irregular and adaptive master-worker and divide-and-conquer applications avoiding explicit mpi calls.  ...  Conclusion This paper presents hompi, a directive-based programming and runtime environment for task-parallel applications on clusters of multiprocessor/multi-core nodes.  ... 
doi:10.1007/978-3-642-23397-5_3 fatcat:dsi3dgm32jg5rekwihbi52f3sm

A lock-free cache-friendly software queue buffer for decoupled software pipelining

Wen Ren Chen, Wuu Yang, Wei Chung Hsu
2010 2010 International Computer Symposium (ICS2010)  
However, its success relies on fast inter-core synchronization and communication.  ...  A lock-free, cache-friendly solution need take two different aspects of memory system, memory coherence and memory consistency, into consideration.  ...  ACKNOWLEDGMENT The work reported in this paper is partially supported by National Science Council, Taiwan, Republic of China, under grants NSC 96-2628-E-009-014-MY3, NSC 98-2220-E-009-050, and NSC 98-2220  ... 
doi:10.1109/compsym.2010.5685364 fatcat:efsa7jj54ne3nlaz2m5o2l2onm


Michel Barreteau, Juliette Mattioli, Thierry Grandpierre, Christophe Lavarenne, Yves Sorel, Philippe Bonnot, Philippe Kajfasz
2000 Proceedings of the international conference on Compilers, architectures, and synthesis for embedded systems - CASES '00  
PROMPT [1] provides a new approach which relies on the co-operation of two technologies whose main strength consists in simultaneously taking into account regular and irregular aspects of telecom applications  ...  Increasing of computation needs and improving of processor integration make the mapping of embedded real-time applications more and more expensive.  ...  One of them is optimized to handle SIMD and regular aspects of SP applications and SOC, whereas the other one takes into account irregular and MIMD aspects required by the SOC and such applications.  ... 
doi:10.1145/354880.354887 dblp:conf/cases/BarreteauMGLSBK00 fatcat:oujnkk2tercbbnxupqclbtkofi
« Previous Showing results 1 — 15 out of 1,103 results