Filters








502 Hits in 10.4 sec

Design of throughput-optimized arrays from recurrence abstractions

Arpith C. Jacob, Jeremy D. Buhler, Roger D. Chamberlain
2010 ASAP 2010 - 21st IEEE International Conference on Application-specific Systems, Architectures and Processors  
FPGAs in particular are good at implementing recurrence equations realized as arrays. Existing high-level synthesis approaches for recurrence equations produce an array that is latency-space optimal.  ...  In this work, we introduce a new design-space exploration procedure within the polyhedral framework to optimize throughput of a systolic array subject to area and bandwidth constraints of an FPGA device  ...  This observation leads to an efficient search strategy for finding arrays with optimal throughput.  ... 
doi:10.1109/asap.2010.5540753 dblp:conf/asap/JacobBC10 fatcat:blvol54tsrhuxkhg73k2iqazbe

Evaluating support for global address space languages on the Cray X1

Christian Bell, Wei-Yu Chen, Dan Bonachea, Katherine Yelick
2004 Proceedings of the 18th annual international conference on Supercomputing - ICS '04  
an interesting target for GAS languages.  ...  and global pointers for the Berkeley UPC compiler.  ...  We believe that both the pointer and GASNet work will be useful on other architectures with similar memory layout and access characteristics (specifically including the SGI Altix), and that our analysis  ... 
doi:10.1145/1006209.1006236 dblp:conf/ics/BellCBY04 fatcat:zscmxn62hrebhl7x27dt6gp3cy

Controller Synthesis for Mapping Partitioned Programs on Array Architectures [chapter]

Hritam Dutta, Frank Hannig, Jürgen Teich
2006 Lecture Notes in Computer Science  
This paper presents an efficient methodology for the automated control path synthesis for the mapping of partitioned algorithms onto processor arrays.  ...  Processor arrays can be used as accelerators for a plenty of dataflow-dominant applications.  ...  The space-time mapping is an important transformation for obtaining full-size processor array descriptions from a given nested loop program.  ... 
doi:10.1007/11682127_13 fatcat:2ri566quzneadmx7p6pwjhwgwa

Dependence Analysis and Parallelizing Transformations [chapter]

Sanjay Rajopadhye
2002 The Compiler Design Handbook  
For an affine allocation function, observe that two points z and z in D X are mapped to the same processor if and only if Φz = Φz , i.e., (z − z ) belongs to the null space or kernel 7 of Φ.  ...  One might question whether choosing an optimal mapping to these virtual processors will remain optimal after the virtual-to-physical mapping that is expected to follow.  ... 
doi:10.1201/9781420040579.ch10 fatcat:xgduobxz5jhangkq3g2tvafsua

Financial software on GPUs

Cosmin E. Oancea, Christian Andreetta, Jost Berthold, Alain Frisch, Fritz Henglein
2012 Proceedings of the 1st ACM SIGPLAN workshop on Functional high-performance computing - FHPC '12  
We discover a rich optimization space with nontrivial trade-offs and cost models.  ...  This paper presents a real-world pricing kernel for financial derivatives and evaluates the language and compiler tool chain that would allow expressive, hardware-neutral algorithm implementation and efficient  ...  The Haskell code shown here has in fact been written as a prototype for reasoning about potential parallelization strategies for a C+GPU version; while at the same time providing the basis for an optimized  ... 
doi:10.1145/2364474.2364484 dblp:conf/icfp/OanceaABFH12 fatcat:emwjal43irftxlhsstzlbyqga4

Constraint directed CAD tool for automatic latency-optimal implementation of 1-D and 2-D Fourier transforms

J. Gregory Nash, John Schewel, Philip B. James-Roxby, Herman H. Schmit, John T. McHenry
2002 Reconfigurable Technology: FPGAs and Reconfigurable Processors for Computing and Communications IV  
A specialized CAD tool is described that will take a user's high level code description of a non-uniform affinely indexed algorithm and automatically generate abstract latency-optimal systolic arrays.  ...  The tool is then used to generate new 1-D and 2-D hardware efficient systolic arrays for the discreet Fourier transform that take advantage of the use of the radix-4 matrix decomposition.  ...  SPADE DESCRIPTION SPADE takes as input a coded form of a system of non-uniform affine recurrence equation T and x t represent affine "mappings" from one index set to another index set, where the latter  ... 
doi:10.1117/12.455338 fatcat:jcwzawy755c3nleaxxqfcwbvye

A processor-time-minimal systolic array for cubical mesh algorithms

P. Cappello
1992 IEEE Transactions on Parallel and Distributed Systems  
Systolic realizations are natural for any algorithm that has an iterative dependence graph; such graphs can be realized using only local communication in space and time.  ...  [emphasis added] Such dags can be extracted automatically from, for example, uniform recurrent equations [37], a system of uniform recurrence equations [24] , regular iterative arrays [43, 20], a system  ...  The question nonetheless arises as to whether relaxing the linearity constraint results in an even more efficient use of time and space.  ... 
doi:10.1109/71.113078 fatcat:obhiyczctfhqfhmqewvszrzdjq

Modulo graph embedding

Hyunchul Park, Kevin Fan, Manjunath Kudlur, Scott Mahlke
2006 Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systems - CASES '06  
Coarse-grained reconfigurable architectures (CGRAs) present an appealing hardware platform by providing the potential for high computation throughput, scalability, low cost and energy efficiency.  ...  CGRAs consist of an array of function units and register files generally organized as a two dimensional grid.  ...  Our work proposes a generic scheduling strategy, and memory sharing and other such optimizations can be integrated into our system as a preprocessing step.  ... 
doi:10.1145/1176760.1176778 dblp:conf/cases/ParkFKM06 fatcat:cejmnre3crg2xfniulpxkxexz4

A novel compiler support for automatic parallelization on multicore systems

José M. Andión, Manuel Arenaz, Gabriel Rodríguez, Juan Touriño
2013 Parallel Computing  
This paper proposes a new method for converting a sequential application into a parallel counterpart that can be executed on current multicore processors.  ...  It hinges on an intermediate representation based on the concept of domain-independent kernel (e.g., assignment, reduction, recurrence).  ...  The second step is the construction of an efficient OpenMP-enabled parallelization strategy for the sequential program as a whole.  ... 
doi:10.1016/j.parco.2013.04.003 fatcat:cbptzynydfdl3pm4rcnktfq5ji

Signal Assignment to Hierarchical Memory Organizations for Embedded Multidimensional Signal Processing Systems

F. Balasa, Hongwei Zhu, I.I. Luican
2009 IEEE Transactions on Very Large Scale Integration (vlsi) Systems  
This paper proposes an efficient algorithm for mapping multidimensional arrays to the data memory.  ...  Finding an optimized storage of the usually large arrays from these algorithmic specifications is an essential task of memory allocation.  ...  Shashidhar-for helpful discussions and for sharing with them some of the benchmark tests used in this paper.  ... 
doi:10.1109/tvlsi.2008.2003514 fatcat:u25xgdrwrjeexk5s3td34gq5d4

Lazy tree splitting

Lars Bergstrom, Mike Rainey, John Reppy, Adam Shaw, Matthew Fluet
2010 Proceedings of the 15th ACM SIGPLAN international conference on Functional programming - ICFP '10  
NDP languages provide language features favoring the NDP style, efficient compilation of NDP programs, and various common NDP operations like parallel maps, filters, and sum-like reductions.  ...  If, on the other hand, there are too few large chunks of work, there will be too much sequential processing and processors will sit idle.  ...  Acknowledgments We would like to thank the anonymous referees for their helpful suggestions and the National Science Foundation for their support under Grants CCF-0811389, CCF-0811419, and CCF-1010568.  ... 
doi:10.1145/1863543.1863558 dblp:conf/icfp/BergstromRRSF10 fatcat:hxzifjutp5cqzfxo7wf3fn2kvm

Lazy tree splitting

Lars Bergstrom, Mike Rainey, John Reppy, Adam Shaw, Matthew Fluet
2010 SIGPLAN notices  
NDP languages provide language features favoring the NDP style, efficient compilation of NDP programs, and various common NDP operations like parallel maps, filters, and sum-like reductions.  ...  If, on the other hand, there are too few large chunks of work, there will be too much sequential processing and processors will sit idle.  ...  Acknowledgments We would like to thank the anonymous referees for their helpful suggestions and the National Science Foundation for their support under Grants CCF-0811389, CCF-0811419, and CCF-1010568.  ... 
doi:10.1145/1932681.1863558 fatcat:rtqe25iyynb4zbggnb4ntbmzo4

Advanced optimization strategies in the Rice dHPF compiler

J. Mellor-Crummey, V. Adve, B. Broom, D. Chavarría-Miranda, R. Fowler, G. Jin, K. Kennedy, Q. Yi
2002 Concurrency and Computation  
This optimization significantly reduces communication frequency. • The compiler further reduces message frequency by aggregating communication events for affine references for disjoint sections of an array  ...  . • The compiler can coalesce messages for arbitrary affine references to a data array.  ...  Principally, programmers write a singlethreaded Fortran program and augment it with layout directives to map data elements onto an array of processors.  ... 
doi:10.1002/cpe.647 fatcat:taze6xqwpzhw3hw27yglleqnei

Synthesis, structure and power of systolic computations

Jozef Gruska
1990 Theoretical Computer Science  
This machine consists of a linear array of programmable processors that have been designed in such a way that the whole array can implement efficiently various systolic systems especially for vision and  ...  long term search for improving efficiency.  ...  An emuiatior 0: h, LUI N1 is caiied computational/y uniform if the same number of processors of N, are mapped into each processor of N,, and also the same number of edges of N, are mapped into each edge  ... 
doi:10.1016/0304-3975(90)90190-s fatcat:sztlpeed4jhcdojkzi5z5zyska

Calculation of Stochastic Heating and Emissivity of Cosmic Dust Grains with Optimization for the Intel Many Integrated Core Architecture [article]

Troy A. Porter, Andrey E. Vladimirov
2013 arXiv   pre-print
Our library is highly optimized for general-purpose processors with multiple cores and vector instructions, with hierarchical memory cache structure.  ...  We discuss in detail the optimization steps that we took in order to optimize for the Intel MIC architecture, which also significantly benefited the performance of the code on general-purpose processors  ...  We accordingly improve the code by allocating space for only one array and declaring another array as a pointer referring to the same memory location.  ... 
arXiv:1311.4627v1 fatcat:rdvwotcdcbeizjynby5c77oxaa
« Previous Showing results 1 — 15 out of 502 results