Filters








1,254 Hits in 2.5 sec

Memory access coalescing

Jack W. Davidson, Sanjay Jinturkar
1994 SIGPLAN notices  
Figure 5 : 5 base register has been modified, then the coalescing may not be safe. Flow graph showing alignment and alias checks. can be explained.  ...  Unrolled loop with coalesced memory references.  ... 
doi:10.1145/773473.178259 fatcat:g6m7blvgrbanlezt35bdmueplm

Memory access coalescing

Jack W. Davidson, Sanjay Jinturkar
1994 Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation - PLDI '94  
Figure 5 : 5 base register has been modified, then the coalescing may not be safe. Flow graph showing alignment and alias checks. can be explained.  ...  Unrolled loop with coalesced memory references.  ... 
doi:10.1145/178243.178259 dblp:conf/pldi/DavidsonJ94 fatcat:z4xyq4e3y5ftljexn33empnoiu

Optimizing scientific application loops on stream processors

Li Wang, Xuejun Yang, Jingling Xue, Yu Deng, Xiaobo Yan, Tao Tang, Quan Hoang Nguyen
2008 SIGPLAN notices  
This paper describes a graph coloring compiler framework to allocate on-chip SRF (Stream Register File) storage for optimizing scientific applications on stream processors.  ...  Our framework consists of first applying enabling optimizations such as loop unrolling to expose stream reuse and opportunities for maximizing parallelism, i.e., overlapping kernel execution and memory  ...  Based on Chaintin's original formulation [3] , a variety of graph coloring based register allocators have been developed [2, 5, 9] .  ... 
doi:10.1145/1379023.1375679 fatcat:omdqm3svlfhppldqfrxkplwtd4

Optimizing scientific application loops on stream processors

Li Wang, Xuejun Yang, Jingling Xue, Yu Deng, Xiaobo Yan, Tao Tang, Quan Hoang Nguyen
2008 Proceedings of the 2008 ACM SIGPLAN-SIGBED conference on Languages, compilers, and tools for embedded systems - LCTES '08  
This paper describes a graph coloring compiler framework to allocate on-chip SRF (Stream Register File) storage for optimizing scientific applications on stream processors.  ...  Our framework consists of first applying enabling optimizations such as loop unrolling to expose stream reuse and opportunities for maximizing parallelism, i.e., overlapping kernel execution and memory  ...  Based on Chaintin's original formulation [3] , a variety of graph coloring based register allocators have been developed [2, 5, 9] .  ... 
doi:10.1145/1375657.1375679 dblp:conf/lctrts/WangYXDYTN08 fatcat:3ifb65yrzzafjobur4chskyyzy

CUDA Memory Optimizations for Large Data-Structures in the Gravit Simulator

Jakob Siegel, Juergen Ributzka, Xiaoming Li
2009 2009 International Conference on Parallel Processing Workshops  
Furthermore, we analyze the performance increase by fully unrolling the innermost loop of the algorithm and propose guidelines on how to best unroll a program for the GPU.  ...  In particular, even that loop unrolling is a common optimization, the performance improvement on a GPU derives from a completely different aspect of this architecture.  ...  Loop unrolling Most of the algorithms that are suited for being implemented in CUDA are heavily loop based.  ... 
doi:10.1109/icppw.2009.78 dblp:conf/icppw/SiegelRL09 fatcat:tpehjt63vzfmzdjaphuecocie4

CUDA Memory Optimizations for Large Data-Structures in the Gravit Simulator

Jakob Siegel, Juergen Ributzka, Xiaoming Li
2011 Journal of Algorithms & Computational Technology  
Furthermore, we analyze the performance increase by fully unrolling the innermost loop of the algorithm and propose guidelines on how to best unroll a program for the GPU.  ...  In particular, even that loop unrolling is a common optimization, the performance improvement on a GPU derives from a completely different aspect of this architecture.  ...  Loop unrolling Most of the algorithms that are suited for being implemented in CUDA are heavily loop based.  ... 
doi:10.1260/1748-3018.5.2.341 fatcat:ho65ohaffvfmdjqkd6n72cinlm

A GPGPU compiler for memory optimization and parallelism management

Yi Yang, Ping Xiang, Jingfei Kong, Huiyang Zhou
2010 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation - PLDI '10  
Our optimization process includes vectorization and memory coalescing for memory bandwidth enhancement, tiling and unrolling for data reuse and parallelism management, and thread block remapping or addressoffset  ...  (b) The coalesced mv kernel (a) The coalesced mm kernel , the access 'a[idy][i]' is not coalesced, which results in loop unrolling as described above.  ...  The compiler unrolls the loop for 16 times, introduces shared memory variable sA[0:15] which are initialized with A[idy][tidx+i] (coalesced as the increment of 'i' is 16 after unrolling), and replaces  ... 
doi:10.1145/1806596.1806606 dblp:conf/pldi/YangXKZ10 fatcat:d36xrccpdra6de3s2cvganylf4

A GPGPU compiler for memory optimization and parallelism management

Yi Yang, Ping Xiang, Jingfei Kong, Huiyang Zhou
2010 SIGPLAN notices  
Our optimization process includes vectorization and memory coalescing for memory bandwidth enhancement, tiling and unrolling for data reuse and parallelism management, and thread block remapping or addressoffset  ...  (b) The coalesced mv kernel (a) The coalesced mm kernel , the access 'a[idy][i]' is not coalesced, which results in loop unrolling as described above.  ...  The compiler unrolls the loop for 16 times, introduces shared memory variable sA[0:15] which are initialized with A[idy][tidx+i] (coalesced as the increment of 'i' is 16 after unrolling), and replaces  ... 
doi:10.1145/1809028.1806606 fatcat:olbf6a5zuvcwnnkyqoiw5lyrse

Sponge

Amir H. Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, Scott Mahlke
2011 Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems - ASPLOS '11  
Loop unrolling is one way to reduce the overhead. This optimization can also increase the register utilization by unrolling loops that use registers.  ...  The degree of unrolling depends on the number of registers the kernel uses and also the number of registers that are available on the GPU.  ... 
doi:10.1145/1950365.1950409 dblp:conf/asplos/HormatiSWMM11 fatcat:dc7hit25y5dpljkppt7opqp3wi

Sponge

Amir H. Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, Scott Mahlke
2012 SIGPLAN notices  
Loop unrolling is one way to reduce the overhead. This optimization can also increase the register utilization by unrolling loops that use registers.  ...  The degree of unrolling depends on the number of registers the kernel uses and also the number of registers that are available on the GPU.  ... 
doi:10.1145/2248487.1950409 fatcat:tumeor6jljchjebwplu3hwecla

Sponge

Amir H. Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, Scott Mahlke
2011 SIGPLAN notices  
Loop unrolling is one way to reduce the overhead. This optimization can also increase the register utilization by unrolling loops that use registers.  ...  The degree of unrolling depends on the number of registers the kernel uses and also the number of registers that are available on the GPU.  ... 
doi:10.1145/1961296.1950409 fatcat:7igavklhwjcnpnyg3xeoopcmfa

Sponge

Amir H. Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, Scott Mahlke
2011 SIGARCH Computer Architecture News  
Loop unrolling is one way to reduce the overhead. This optimization can also increase the register utilization by unrolling loops that use registers.  ...  The degree of unrolling depends on the number of registers the kernel uses and also the number of registers that are available on the GPU.  ... 
doi:10.1145/1961295.1950409 fatcat:4msvwt2sbzcnbmvcdfvpt2k7q4

A Fine-Grained Pipelined Implementation of LU Decomposition on SIMD Processors [chapter]

Kai Zhang, ShuMing Chen, Wei Liu, Xi Ning
2013 Lecture Notes in Computer Science  
By transforming the non-coalesced memory access to coalesced version, the proposed algorithm can achieve the high pipeline parallelism and the high efficient memory access.  ...  The GPUs based system is a popular method.  ...  The external loop unrolling method of the fine-grained pipelined algorithm transforms all non-coalesced memory access to coalesced version, which is the high efficient memory access way.  ... 
doi:10.1007/978-3-642-40820-5_4 fatcat:ysyroxx5abgixozjkpj5s7wyam

Communication-minimizing 2D convolution in GPU registers

Forrest N. Iandola, David Sheffield, Michael J. Anderson, Phitchaya Mangpo Phothilimthana, Kurt Keutzer
2013 2013 IEEE International Conference on Image Processing  
To reduce memory communication, we reorganize the convolution algorithm to prefetch image regions to register, and we do more work per thread with fewer threads.  ...  To enable portability to future architectures, we implement a convolution autotuner that sweeps the design space of memory layouts and loop unrolling configurations.  ...  However, more unrolling leads to longer strides in memory accesses which, as discussed in Section 2.2, reduces coalescing and thus reduces usable bandwidth.  ... 
doi:10.1109/icip.2013.6738436 dblp:conf/icip/IandolaSAPK13 fatcat:lnmb2fwaizfkndjkkhtzfgyw3y

A unified optimizing compiler framework for different GPGPU architectures

Yi Yang, Ping Xiang, Jingfei Kong, Mike Mantor, Huiyang Zhou
2012 ACM Transactions on Architecture and Code Optimization (TACO)  
, the access 'a[idy][i]' is not coalesced, which results in loop unrolling as described above. 'b[i][idx]' is coalesced and it transforms to 'b[(i+k)][idx]' due to unrolling for 'a[idy][i]'.  ...  The compiler unrolls the loop for 16 times, introduces shared memory variable sA[0:15] which are initialized with A[idy][tidx+i] (coalesced as the increment of 'i' is 16 after unrolling), and replaces  ... 
doi:10.1145/2207222.2207225 fatcat:yx6p2hyun5cd3bstd76xp7xwom
« Previous Showing results 1 — 15 out of 1,254 results