Filters








38,935 Hits in 3.2 sec

Optimizing aggregate array computations in loops

Yanhong A. Liu, Scott D. Stoller, Ning Li, Tom Rothamel
2005 ACM Transactions on Programming Languages and Systems  
An aggregate array computation is a loop that computes accumulated quantities over array elements.  ...  Such computations are common in programs that use arrays, and the array elements involved in such computations often overlap, especially across iterations of loops, resulting in significant redundancy  ...  of Section 5.4 and which helped us understand the effect of our optimization on cache.  ... 
doi:10.1145/1053468.1053471 fatcat:th7uti4na5dm3iezu45cszlfay

Believe it or not!

Rajesh Bordawekar, Uday Bondhugula, Ravi Rao
2010 Proceedings of the 19th international conference on Parallel architectures and compilation techniques - PACT '10  
compiler parallelization and optimization.  ...  In this work, we evaluate performance of a real-world image processing application that uses a cross-correlation algorithm to compare a given image with a reference one.  ...  Acknowledgments We would like to acknowledge Tommy Wong, Salem Derisavi, Tong Chen, and Alexandre Eichenberger for useful comments and help in understanding some performance anomalies.  ... 
doi:10.1145/1854273.1854340 dblp:conf/IEEEpact/BordawekarBR10 fatcat:p74xpqe3pvakndu3jy4bjspnga

Affine Loop Optimization Based on Modulo Unrolling in Chapel

Aroon Sharma, Darren Smith, Joshua Koehler, Rajeev Barua, Michael Ferguson
2014 Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models - PGAS '14  
This paper presents modulo unrolling without unrolling (modulo unrolling WU), a method for message aggregation for parallel loops in message passing programs that use affine array accesses in Chapel, a  ...  loop.  ...  The output of the optimization is an equivalent loop structure that aggregates communication from all of the loop body's remote affine array accesses.  ... 
doi:10.1145/2676870.2676877 dblp:conf/pgas/SharmaSKBF14 fatcat:4blvcp2vdjfqzkowr6uerwfhfa

Source-level global optimizations for fine-grain distributed shared memory systems

R. Veldema, R. F. H. Hofman, R. A. F. Bhoedjang, C. J. H. Jacobs, H. E. Bal
2001 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming - PPoPP '01  
Source-level analysis makes existing accesscheck optimizations (e.g., access-check batching) more effective and enables two novel fine-grain DSM optimizations: object-graph aggregation and automatic computation  ...  Computation migration (or function shipping) is used to optimize critical sections in which a single processor owns both the shared data that is accessed and the lock that protects the data.  ...  Array Aggregation If array elements are accessed in a loop, the access checks to array elements may sometimes be lifted out of the loop, and be replaced by an aggregate array slice access check before  ... 
doi:10.1145/379539.379578 dblp:conf/ppopp/VeldemaHBJB01 fatcat:rajtv7cjiranbouw4t5agjefcm

Source-level global optimizations for fine-grain distributed shared memory systems

R. Veldema, R. F. H. Hofman, R. A. F. Bhoedjang, C. J. H. Jacobs, H. E. Bal
2001 SIGPLAN notices  
Source-level analysis makes existing accesscheck optimizations (e.g., access-check batching) more effective and enables two novel fine-grain DSM optimizations: object-graph aggregation and automatic computation  ...  Computation migration (or function shipping) is used to optimize critical sections in which a single processor owns both the shared data that is accessed and the lock that protects the data.  ...  Array Aggregation If array elements are accessed in a loop, the access checks to array elements may sometimes be lifted out of the loop, and be replaced by an aggregate array slice access check before  ... 
doi:10.1145/568014.379578 fatcat:euyfq7feercqnnibmos3nwvswa

Cache-efficient memory layout of aggregate data structures

Preeti Ranjan Panda, Luc Semeria, Giovanni de Micheli
2001 Proceedings of the 14th international symposium on Systems synthesis - ISSS '01  
We describe an important memory optimization that arises in the presence of aggregate data structures such as arrays and structs in a C/C++ based system design methodology.  ...  Experiments on typical applications from the DSP domain result in up to 44% improvement in memory performance.  ...  : given a set of arrays of either simple data types such as integer, or aggregate data types such as structs; and a set of innermost loops in a program accessing different arrays with different array index  ... 
doi:10.1145/500024.500026 fatcat:jnkms3dvgnd3laetp7q6wdz6fu

Cache-efficient memory layout of aggregate data structures

Preeti Ranjan Panda, Luc Semeria, Giovanni de Micheli
2001 Proceedings of the 14th international symposium on Systems synthesis - ISSS '01  
We describe an important memory optimization that arises in the presence of aggregate data structures such as arrays and structs in a C/C++ based system design methodology.  ...  Experiments on typical applications from the DSP domain result in up to 44% improvement in memory performance.  ...  : given a set of arrays of either simple data types such as integer, or aggregate data types such as structs; and a set of innermost loops in a program accessing different arrays with different array index  ... 
doi:10.1145/500001.500026 fatcat:rewqipwppbedjcfs462cpkklei

High-Level Synthesis: Productivity, Performance, and Software Constraints

Yun Liang, Kyle Rupnow, Yinan Li, Dongbo Min, Minh N. Do, Deming Chen
2012 Journal of Electrical and Computer Engineering  
FPGAs are an attractive platform for applications with high computation demand and low energy consumption requirements.  ...  In particular, we first evaluate AutoPilot using the popular embedded benchmark kernels.  ...  In this step, we examine the computation loops in the program and apply loop pipelining, loop merging, loop unrolling, loop flattening, and expression balancing to optimize performance.  ... 
doi:10.1155/2012/649057 fatcat:lvu2kniyyvaa7prpklymhslf5m

A report on the sisal language project

John T. Feo, David C. Cann, Rodney R. Oldehoeft
1990 Journal of Parallel and Distributed Computing  
In this report we discuss the project's objectives, philosophy, and accomplishments and state our future plans.  ...  Four significant results of the Sisal project are compilation techniques for high-performance parallel applicative computation, a microtasking environment that supports dataflow on conventional shared-memory  ...  One consequence of this policy is that users must specify the order in which elements of recursive aggregates are computed. Consider the array definition X(i,j) = 1, j= 1, tation.  ... 
doi:10.1016/0743-7315(90)90035-n fatcat:3r2n5dujvffjxhlz2dgxzlwx5a

An automated approach to improve communication-computation overlap in clusters

L. Fishgold, A. Danalis, L. Pollock, M. Swany
2006 Proceedings 20th IEEE International Parallel & Distributed Processing Symposium  
This paper describes a sourceto-source optimizing transformation that can be performed by an automatic (or semi-automatic) system in order to restructure MPI codes towards maximizing communication-computation  ...  For this approach to be effective the parallel application using the cluster must be structured in a way that enables communication computation overlapping.  ...  Many scientific codes contain frequently executed sections consisting of a multiply-nested loop in which the inner loops execute some computation kernel and store the results in an array which is then  ... 
doi:10.1109/ipdps.2006.1639590 dblp:conf/ipps/FishgoldDPS06 fatcat:c32yae6amffnbhel4crzqwihki

Efficient iterative processing in the SciDB parallel array engine

Emad Soroush, Magdalena Balazinska, Simon Krughoff, Andrew Connolly
2015 Proceedings of the 27th International Conference on Scientific and Statistical Database Management - SSDBM '15  
In this paper, we develop a model for iterative array computations and a series of optimizations.  ...  Many scientific data-intensive applications perform iterative computations on array data. There exist multiple engines specialized for array processing.  ...  In case an operator in SciDB is guided by Array-Loop to request repartitioning, the SciDB optimizer injects the Scatter/Gather [14] operators to shuffle the data in the input iterative array before the  ... 
doi:10.1145/2791347.2791362 dblp:conf/ssdbm/SoroushBKC15 fatcat:kjfqzmtdhvfjxevv6a4viujd34

Combining Static and Dynamic Data Coalescing in Unified Parallel C

Michail Alvanos, Montse Farreras, Ettore Tiotto, Jose Nelson Amaral, Xavier Martorell
2016 IEEE Transactions on Parallel and Distributed Systems  
Significant progress has been made in the development of programming languages and tools that are suitable for hybrid computer architectures that group several shared-memory multicores interconnected through  ...  When the data is distributed to remote computing nodes, code transformations are required to prevent performance degradation.  ...  In contrast, the solution described in this paper focuses on loops that contain fine-grained communication and achieves much better aggregation and overlapping of communication and computation.  ... 
doi:10.1109/tpds.2015.2405551 fatcat:isr4fuw6nvfpzfo4abngauwame

Strider: Runtime Support for Optimizing Strided Data Accesses on Multi-Cores with Explicitly Managed Memories

Jae-Seung Yeom, Dimitrios S. Nikolopoulos
2010 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis  
Strider transparently optimizes grouping, decomposition, and scheduling of explicit software-managed accesses to multi-dimensional arrays in nested loops, given a highlevel specification of loops and their  ...  In particular, Strider contributes new methods to improve temporal locality, optimize the critical path of scheduling data transfers for multi-stride accesses in regular nested parallel loops, and distribute  ...  The runtime system performs aggregation by fusing loop levels in the partition of the iteration space assigned to an SPE and re-blocking the fused loops, under the constraint that the aggregated working  ... 
doi:10.1109/sc.2010.52 dblp:conf/sc/YeomN10 fatcat:ay2dpu3dczdkdbx5yaohzf6xty

Aggregating processor free time for energy reduction

Aviral Shrivastava, Eugene Earlie, Nikil Dutt, Alex Nicolau
2005 Proceedings of the 3rd IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis - CODES+ISSS '05  
In this paper, we present code transformations to aggregate processor free time.  ...  However, any such technique has a performance overhead in terms of switching time.  ...  Loop unrolling is a popular optimization that reduces the computation per Iteration (C) of a loop.  ... 
doi:10.1145/1084834.1084876 dblp:conf/codes/ShrivastavaEDN05 fatcat:paswyhfok5bgbagvzh4ezzu6ym

Region array SSA

Silvius Rus, Guobin He, Christophe Alias, Lawrence Rauchwerger
2006 Proceedings of the 15th international conference on Parallel architectures and compilation techniques - PACT '06  
In this paper we propose to improve the applicability of previous efforts in array SSA through the use of a symbolic memory access descriptor that can aggregate the accesses to the elements of an array  ...  scalar optimizations.  ...  It can represent the aggregation of scalar and array memory references at any hierarchical level (on the loop and subprogram call graph) in a program.  ... 
doi:10.1145/1152154.1152165 dblp:conf/IEEEpact/RusHAR06 fatcat:jtrqutzp3ncbxon5xrwdnfppgu
« Previous Showing results 1 — 15 out of 38,935 results