Filters








1,415 Hits in 4.4 sec

A detailed GPU cache model based on reuse distance theory

Cedric Nugteren, Gert-Jan van den Braak, Henk Corporaal, Henri Bal
2014 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)  
This model by Tang et al. [21] is also based on reuse distance theory. However, there are a number of reasons why we propose a new cache model.  ...  Our contributions can be summarised as follows: • Five extensions to the reuse distance theory are proposed, creating a detailed cache model for GPUs (section 4).  ...  However, there are still cases where associativity misses account for a significant fraction of the total amount of misses, in particular for the Poly-Bench/GPU benchmarks. • These benchmarks show no additional  ... 
doi:10.1109/hpca.2014.6835955 dblp:conf/hpca/NugterenBCB14 fatcat:473kmw2jk5ablnyo4trhcc6pd4

On-the-fly Vertex Reuse for Massively-Parallel Software Geometry Processing [article]

Michael Kenzel, Bernhard Kerbl, Wolfgang Tatzgern, Elena Ivanchenko, Dieter Schmalstieg, Markus Steinberger
2018 arXiv   pre-print
On actual GPU hardware, our evaluation shows that our strategies not only lead to good reuse of processing results, but also boost performance by 2-3× compared to naïvely ignoring reuse in a variety of  ...  Our simulations showcase that our batch-based strategies significantly outperform parallel caches in terms of reuse.  ...  Loop subdivision produces a piecewise linear approximation of smooth surfaces based on B-spline and multivariate spline theory.  ... 
arXiv:1805.08893v1 fatcat:7yyo2xdmmvblxphyqu7llkrsgq

Analytical Modeling the Multi-Core Shared Cache Behavior with Considerations of Data-Sharing and Coherence

Ming Ling, Xiaoqian Lu, Guangmin Wang, Jiancong Ge
2021 IEEE Access  
[16] proposed a footprint theory based on the concept of the memory footprint.  ...  [27] verified that the reuse distance theory can be used in GPU's thread by modeling caches in details approaching to the real-hardware application. Kiani et al.  ... 
doi:10.1109/access.2021.3053350 fatcat:4nt7ucqlpveotl7hsscjhimnue

An FPGA-based Hardware Accelerator for Real-Time Block-Matching and 3D Filtering

Dong Wang, Jia Xu, Ke Xu
2020 IEEE Access  
A deeply pipelined OpenCL kernel architecture together with a linebuffer-based on-chip data caching scheme were developed to maximize data reuse and reduce external memory bandwidth.  ...  ., the on-board DDR memory) and caches frequently reused data in on-chip memory to further reduce the pressure on global memory bandwidth.  ... 
doi:10.1109/access.2020.3006773 fatcat:ynou7fjaerbxvlveroa6if6lti

Hardware Accelerated Skin Deformation for Animated Crowds [chapter]

Golam Ashraf, Junyu Zhou
2006 Lecture Notes in Computer Science  
This paper explores skeletal deformation calculations on the GPU for crowds of articulated figures. It compares a few strategies for efficient reuse of such calculations on clones.  ...  The system has been implemented for modern PCs with Graphics Accelerator cards that support GPU Shader Model 3.0, and come with accelerated bi-directional PCI express bus communication.  ...  Luebke D. et al [9] implement a level of detail (LOD) representation based on camera distance.  ... 
doi:10.1007/978-3-540-69429-8_23 fatcat:jv4xpmfi7nhs5j223spziqwqua

Locality-Aware CTA Clustering for Modern GPUs

Ang Li, Shuaiwen Leon Song, Weifeng Liu, Xu Liu, Akash Kumar, Henk Corporaal
2017 ACM SIGOPS Operating Systems Review  
Cache is designed to exploit locality; however, the role of onchip L1 data caches on modern GPUs is often awkward.  ...  We evaluate our techniques using a wide range of popular GPU applications on all modern generations of NVIDIA GPU architectures.  ...  Based on the characterization of a wide spectrum of GPU applications (see Table 2 for details), we classify the sources of GPU inter-CTA locality into the following five categories.  ... 
doi:10.1145/3093315.3037709 fatcat:h7vhnovsqndmxduewwjw5fpy6e

Locality-Aware CTA Clustering for Modern GPUs

Ang Li, Shuaiwen Leon Song, Weifeng Liu, Xu Liu, Akash Kumar, Henk Corporaal
2017 SIGPLAN notices  
Cache is designed to exploit locality; however, the role of onchip L1 data caches on modern GPUs is often awkward.  ...  We evaluate our techniques using a wide range of popular GPU applications on all modern generations of NVIDIA GPU architectures.  ...  Based on the characterization of a wide spectrum of GPU applications (see Table 2 for details), we classify the sources of GPU inter-CTA locality into the following five categories.  ... 
doi:10.1145/3093336.3037709 fatcat:summls6gdza45pmuih42753rhy

Locality-Aware CTA Clustering for Modern GPUs

Ang Li, Shuaiwen Leon Song, Weifeng Liu, Xu Liu, Akash Kumar, Henk Corporaal
2017 Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS '17  
Cache is designed to exploit locality; however, the role of onchip L1 data caches on modern GPUs is often awkward.  ...  We evaluate our techniques using a wide range of popular GPU applications on all modern generations of NVIDIA GPU architectures.  ...  Based on the characterization of a wide spectrum of GPU applications (see Table 2 for details), we classify the sources of GPU inter-CTA locality into the following five categories.  ... 
doi:10.1145/3037697.3037709 dblp:conf/asplos/LiS0L0C17 fatcat:vw7yzpigbbhp5ml7alahsiggc4

Locality-Aware CTA Clustering for Modern GPUs

Ang Li, Shuaiwen Leon Song, Weifeng Liu, Xu Liu, Akash Kumar, Henk Corporaal
2017 SIGARCH Computer Architecture News  
Cache is designed to exploit locality; however, the role of onchip L1 data caches on modern GPUs is often awkward.  ...  We evaluate our techniques using a wide range of popular GPU applications on all modern generations of NVIDIA GPU architectures.  ...  Based on the characterization of a wide spectrum of GPU applications (see Table 2 for details), we classify the sources of GPU inter-CTA locality into the following five categories.  ... 
doi:10.1145/3093337.3037709 fatcat:4lmnja4e6veonfg2ojox7a6pbq

PPT-Multicore: Performance Prediction of OpenMP applications using Reuse Profiles and Analytical Modeling [article]

Atanu Barai and Yehia Arafa and Abdel-Hameed Badawy and Gopinath Chennupati and Nandakishore Santhi and Stephan Eidenbenz
2021 arXiv   pre-print
PPT-Multicore builds upon our previous work towards a multicore cache model.  ...  We use a probabilistic and computationally efficient reuse profile to predict the cache hit rates and runtimes of OpenMP programs' parallel sections.  ...  Some of the experiments in this paper were run on the donated machines. This work is partially supported by Triad National Security, LLC subcontract #581326.  ... 
arXiv:2104.05102v1 fatcat:mtcyzxf5g5hslogtp5mra5z7wa

OSCA: An Online-Model Based Cache Allocation Scheme in Cloud Block Storage Systems

Yu Zhang, Ping Huang, Ke Zhou, Hua Wang, Jianying Hu, Yongguang Ji, Bin Cheng
2020 USENIX Annual Technical Conference  
Third, it searches for a near optimal configuration using a dynamic programming method and performs cache reassignment based on the solution.  ...  Our model uses a low overhead method to obtain data reuse distances from the ratio of re-access traffic to the total traffic within a time window.  ...  In this paper, we propose a cache allocation scheme named OSCA based on a novel cache model leveraging re-access ratio.  ... 
dblp:conf/usenix/Zhang00WHJC20 fatcat:onixsmmdzjbipjkiv557jghexq

Machine Learning Enabled Scalable Performance Prediction of Scientific Codes [article]

Gopinath Chennupati and Nandakishore Santhi and Phill Romero and Stephan Eidenbenz
2020 arXiv   pre-print
PPT-AMMP uses machine learning and regression techniques to build the prediction models based on small instances of the input code, then integrates into a higher-order discrete-event simulation model of  ...  distance distribution models for each basic block, (iii) runs detailed basic-block level simulations to determine hardware pipeline usage.  ...  The mixture model for finding the hit-rates (ℎ | ) at a given reuse distance is shown in Eq. 6, derived from the stack distance based cache model (SDCM) [11] to estimate cache hit-rates.  ... 
arXiv:2010.04212v2 fatcat:53bor5hw5zgxpp5feymwg5imsy

A Graph-based Model for GPU Caching Problems [article]

Lingda Li, Ari B. Hayes, Stephen A. Hackler, Eddy Z. Zhang, Mario Szegedy, Shuaiwen Leon Song
2016 arXiv   pre-print
Modeling data sharing in GPU programs is a challenging task because of the massive parallelism and complex data sharing patterns provided by GPU architectures.  ...  Better GPU caching efficiency can be achieved through careful task scheduling among different threads.  ...  In this paper, we also used a graph-based model to tackle the shared cache problem for irregular GPU applications.  ... 
arXiv:1605.02043v1 fatcat:6fefayucc5dathpxsya7j7kayi

GPU-Assisted High Quality Particle Rendering

Deukhyun Cha, Sungjin Son, Insung Ihm
2009 Computer graphics forum (Print)  
Then, the volume data is visualized efficiently based on the volume photon mapping method where our GPU techniques further improve the rendering quality offered by previous implementations while performing  ...  Visualizing dynamic participating media in particle form by fully solving equations from the light transport theory is a computationally very expensive process.  ...  In order to accelerate it, we used a rectangle drawing-based method that provides both fast and adaptive conversion on the GPU.  ... 
doi:10.1111/j.1467-8659.2009.01502.x fatcat:6kaychggcvhhriiw3yeszrfelu

High-level hardware feature extraction for GPU performance prediction of stencils

Toomas Remmelg, Bastian Hagedorn, Lu Li, Michel Steuwer, Sergei Gorlatch, Christophe Dubach
2020 Proceedings of the 13th Annual Workshop on General Purpose Processing using Graphics Processing Unit  
Performance models based on statistical techniques have been proposed to speedup the optimization space exploration.  ...  This paper shows how to extract low-level features such as number of unique cache lines accessed per warp, which is crucial for building accurate GPU performance models.  ...  GPU cache models [27] have been built by extending reuse distance theory with parallel execution, memory latency, limited associativity, miss-status holding-registers and warp divergence.  ... 
doi:10.1145/3366428.3380769 dblp:conf/ppopp/RemmelgHLSGD20 fatcat:yf7rtt43evgetmtdz5sy3nmju4
« Previous Showing results 1 — 15 out of 1,415 results