A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2017; you can also visit the original URL.
The file type is application/pdf
.
Filters
Optimizing aggregate array computations in loops
2005
ACM Transactions on Programming Languages and Systems
An aggregate array computation is a loop that computes accumulated quantities over array elements. ...
Such computations are common in programs that use arrays, and the array elements involved in such computations often overlap, especially across iterations of loops, resulting in significant redundancy ...
of Section 5.4 and which helped us understand the effect of our optimization on cache. ...
doi:10.1145/1053468.1053471
fatcat:th7uti4na5dm3iezu45cszlfay
Believe it or not!
2010
Proceedings of the 19th international conference on Parallel architectures and compilation techniques - PACT '10
compiler parallelization and optimization. ...
In this work, we evaluate performance of a real-world image processing application that uses a cross-correlation algorithm to compare a given image with a reference one. ...
Acknowledgments We would like to acknowledge Tommy Wong, Salem Derisavi, Tong Chen, and Alexandre Eichenberger for useful comments and help in understanding some performance anomalies. ...
doi:10.1145/1854273.1854340
dblp:conf/IEEEpact/BordawekarBR10
fatcat:p74xpqe3pvakndu3jy4bjspnga
Affine Loop Optimization Based on Modulo Unrolling in Chapel
2014
Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models - PGAS '14
This paper presents modulo unrolling without unrolling (modulo unrolling WU), a method for message aggregation for parallel loops in message passing programs that use affine array accesses in Chapel, a ...
loop. ...
The output of the optimization is an equivalent loop structure that aggregates communication from all of the loop body's remote affine array accesses. ...
doi:10.1145/2676870.2676877
dblp:conf/pgas/SharmaSKBF14
fatcat:4blvcp2vdjfqzkowr6uerwfhfa
Source-level global optimizations for fine-grain distributed shared memory systems
2001
Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming - PPoPP '01
Source-level analysis makes existing accesscheck optimizations (e.g., access-check batching) more effective and enables two novel fine-grain DSM optimizations: object-graph aggregation and automatic computation ...
Computation migration (or function shipping) is used to optimize critical sections in which a single processor owns both the shared data that is accessed and the lock that protects the data. ...
Array Aggregation If array elements are accessed in a loop, the access checks to array elements may sometimes be lifted out of the loop, and be replaced by an aggregate array slice access check before ...
doi:10.1145/379539.379578
dblp:conf/ppopp/VeldemaHBJB01
fatcat:rajtv7cjiranbouw4t5agjefcm
Source-level global optimizations for fine-grain distributed shared memory systems
2001
SIGPLAN notices
Source-level analysis makes existing accesscheck optimizations (e.g., access-check batching) more effective and enables two novel fine-grain DSM optimizations: object-graph aggregation and automatic computation ...
Computation migration (or function shipping) is used to optimize critical sections in which a single processor owns both the shared data that is accessed and the lock that protects the data. ...
Array Aggregation If array elements are accessed in a loop, the access checks to array elements may sometimes be lifted out of the loop, and be replaced by an aggregate array slice access check before ...
doi:10.1145/568014.379578
fatcat:euyfq7feercqnnibmos3nwvswa
Cache-efficient memory layout of aggregate data structures
2001
Proceedings of the 14th international symposium on Systems synthesis - ISSS '01
We describe an important memory optimization that arises in the presence of aggregate data structures such as arrays and structs in a C/C++ based system design methodology. ...
Experiments on typical applications from the DSP domain result in up to 44% improvement in memory performance. ...
: given a set of arrays of either simple data types such as integer, or aggregate data types such as structs; and a set of innermost loops in a program accessing different arrays with different array index ...
doi:10.1145/500024.500026
fatcat:jnkms3dvgnd3laetp7q6wdz6fu
Cache-efficient memory layout of aggregate data structures
2001
Proceedings of the 14th international symposium on Systems synthesis - ISSS '01
We describe an important memory optimization that arises in the presence of aggregate data structures such as arrays and structs in a C/C++ based system design methodology. ...
Experiments on typical applications from the DSP domain result in up to 44% improvement in memory performance. ...
: given a set of arrays of either simple data types such as integer, or aggregate data types such as structs; and a set of innermost loops in a program accessing different arrays with different array index ...
doi:10.1145/500001.500026
fatcat:rewqipwppbedjcfs462cpkklei
High-Level Synthesis: Productivity, Performance, and Software Constraints
2012
Journal of Electrical and Computer Engineering
FPGAs are an attractive platform for applications with high computation demand and low energy consumption requirements. ...
In particular, we first evaluate AutoPilot using the popular embedded benchmark kernels. ...
In this step, we examine the computation loops in the program and apply loop pipelining, loop merging, loop unrolling, loop flattening, and expression balancing to optimize performance. ...
doi:10.1155/2012/649057
fatcat:lvu2kniyyvaa7prpklymhslf5m
A report on the sisal language project
1990
Journal of Parallel and Distributed Computing
In this report we discuss the project's objectives, philosophy, and accomplishments and state our future plans. ...
Four significant results of the Sisal project are compilation techniques for high-performance parallel applicative computation, a microtasking environment that supports dataflow on conventional shared-memory ...
One consequence of this policy is that users must specify the order in which elements of recursive aggregates are computed. Consider the array definition X(i,j) = 1, j= 1, tation. ...
doi:10.1016/0743-7315(90)90035-n
fatcat:3r2n5dujvffjxhlz2dgxzlwx5a
An automated approach to improve communication-computation overlap in clusters
2006
Proceedings 20th IEEE International Parallel & Distributed Processing Symposium
This paper describes a sourceto-source optimizing transformation that can be performed by an automatic (or semi-automatic) system in order to restructure MPI codes towards maximizing communication-computation ...
For this approach to be effective the parallel application using the cluster must be structured in a way that enables communication computation overlapping. ...
Many scientific codes contain frequently executed sections consisting of a multiply-nested loop in which the inner loops execute some computation kernel and store the results in an array which is then ...
doi:10.1109/ipdps.2006.1639590
dblp:conf/ipps/FishgoldDPS06
fatcat:c32yae6amffnbhel4crzqwihki
Efficient iterative processing in the SciDB parallel array engine
2015
Proceedings of the 27th International Conference on Scientific and Statistical Database Management - SSDBM '15
In this paper, we develop a model for iterative array computations and a series of optimizations. ...
Many scientific data-intensive applications perform iterative computations on array data. There exist multiple engines specialized for array processing. ...
In case an operator in SciDB is guided by Array-Loop to request repartitioning, the SciDB optimizer injects the Scatter/Gather [14] operators to shuffle the data in the input iterative array before the ...
doi:10.1145/2791347.2791362
dblp:conf/ssdbm/SoroushBKC15
fatcat:kjfqzmtdhvfjxevv6a4viujd34
Combining Static and Dynamic Data Coalescing in Unified Parallel C
2016
IEEE Transactions on Parallel and Distributed Systems
Significant progress has been made in the development of programming languages and tools that are suitable for hybrid computer architectures that group several shared-memory multicores interconnected through ...
When the data is distributed to remote computing nodes, code transformations are required to prevent performance degradation. ...
In contrast, the solution described in this paper focuses on loops that contain fine-grained communication and achieves much better aggregation and overlapping of communication and computation. ...
doi:10.1109/tpds.2015.2405551
fatcat:isr4fuw6nvfpzfo4abngauwame
Strider: Runtime Support for Optimizing Strided Data Accesses on Multi-Cores with Explicitly Managed Memories
2010
2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Strider transparently optimizes grouping, decomposition, and scheduling of explicit software-managed accesses to multi-dimensional arrays in nested loops, given a highlevel specification of loops and their ...
In particular, Strider contributes new methods to improve temporal locality, optimize the critical path of scheduling data transfers for multi-stride accesses in regular nested parallel loops, and distribute ...
The runtime system performs aggregation by fusing loop levels in the partition of the iteration space assigned to an SPE and re-blocking the fused loops, under the constraint that the aggregated working ...
doi:10.1109/sc.2010.52
dblp:conf/sc/YeomN10
fatcat:ay2dpu3dczdkdbx5yaohzf6xty
Aggregating processor free time for energy reduction
2005
Proceedings of the 3rd IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis - CODES+ISSS '05
In this paper, we present code transformations to aggregate processor free time. ...
However, any such technique has a performance overhead in terms of switching time. ...
Loop unrolling is a popular optimization that reduces the computation per Iteration (C) of a loop. ...
doi:10.1145/1084834.1084876
dblp:conf/codes/ShrivastavaEDN05
fatcat:paswyhfok5bgbagvzh4ezzu6ym
Region array SSA
2006
Proceedings of the 15th international conference on Parallel architectures and compilation techniques - PACT '06
In this paper we propose to improve the applicability of previous efforts in array SSA through the use of a symbolic memory access descriptor that can aggregate the accesses to the elements of an array ...
scalar optimizations. ...
It can represent the aggregation of scalar and array memory references at any hierarchical level (on the loop and subprogram call graph) in a program. ...
doi:10.1145/1152154.1152165
dblp:conf/IEEEpact/RusHAR06
fatcat:jtrqutzp3ncbxon5xrwdnfppgu
« Previous
Showing results 1 — 15 out of 38,935 results