1,818 Hits in 4.4 sec

Characterization and analysis of dynamic parallelism in unstructured GPU applications

Jin Wang, Sudhakar Yalamanchili
2014 2014 IEEE International Symposium on Workload Characterization (IISWC)  
In this study, we seek to characterize such dynamically formed parallelism and and evaluate implementations designed to exploit them using CUDA Dynamic Parallelism (CDP) -an execution model where parallel  ...  However, emerging data intensive applications are increasingly unstructured -irregular in their memory and control flow behavior over massive data sets.  ...  RELATED WORK Characterization and analysis of GPU applications can be traced to a very early time. Kerr et al.  ... 
doi:10.1109/iiswc.2014.6983039 dblp:conf/iiswc/WangY14 fatcat:vvp5mw62uvh6zjt3zrpjmjkb2q

Characterization and transformation of unstructured control flow in bulk synchronous GPU applications

Haicheng Wu, Gregory Diamos, Jin Wang, Si Li, Sudhakar Yalamanchili
2012 The international journal of high performance computing applications  
This paper identifies important classes of program control flows in applications targeted to commodity commercially available graphics processing units (GPUs) and characterizes their presence in real workloads  ...  An unstructured-to-structured control flow transformation for CUDA kernels is implemented and its performance impact on a large class of GPU applications is assessed.  ...  As to the area of GPU application characterization, Kerr et al. (2009) and Goswami et al. (2010) respectively characterized a large number of GPU benchmarks by using a wide range of metrics covering  ... 
doi:10.1177/1094342011434814 fatcat:txbcpc4sqfdcxpunswnm6nxnxu

OpenMP to CUDA graphs

Chenle Yu, Sara Royuela, Eduardo Quiñones
2020 Proceedings of the 23th International Workshop on Software and Compilers for Embedded Systems  
In that regard, OpenMP is a well-known high-level programming model that incorporates powerful task and accelerator models capable of efficiently exploiting structured and unstructured parallelism in heterogeneous  ...  Due to the variety of accelerators, e.g., FPGAs, GPUs, the use of high-level parallel programming models is desirable to exploit the performance capabilities of them, while maintaining an adequate productivity  ...  Manually generating the CUDA graph of a parallel application as the Cholesky decomposition is a very tedious and error-prone process due to the unstructured parallel nature of the application.  ... 
doi:10.1145/3378678.3391881 dblp:conf/scopes/YuRQ20 fatcat:ryye746lwvg2bhol3w7qef5ugy

Characterizing Performance and Power towards Efficient Synchronization of GPU Kernels

Islam Harb, Wu-Chun Feng
2016 2016 IEEE 24th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS)  
In this paper, we present several approaches to inter-block synchronization using explicit/implicit CPU-based and dynamic parallelism (DP) mechanisms.  ...  There is a lack of support for explicit synchronization in GPUs between the streaming multiprocessors (SMs) adversely impacts the performance of the GPUs to efficiently perform inter-block communication  ...  [9] lead a study for characterizing the dynamicallyformed parallelism on irregular (i.e. unstructured) applications on GPUs.  ... 
doi:10.1109/mascots.2016.58 dblp:conf/mascots/HarbF16 fatcat:s5q4kd2arvcpziwovbr67k4rvi

A Similarity Measure for GPU Kernel Subgraph Matching [article]

Robert Lim, Boyana Norris, Allen Malony
2019 arXiv   pre-print
Because the majority of CUDA applications are parallelized loops, control flow information can provide an in-depth characterization of a kernel.  ...  The utility of CUDAflow is demonstrated with SHOC and Rodinia application case studies on a variety of GPU architectures, revealing novel thread divergence characteristics that facilitates end users, autotuners  ...  Execution environment The graphic processor units used in our experiments are listed in Applications Rodinia and SHOC application suite are a class of GPU applications that cover a wide range of computational  ... 
arXiv:1707.02423v3 fatcat:agbtuzu43nfmnefxtik5pm2dna

Analyzing Analytics

Rajesh Bordawekar, Bob Blainey, Ruchir Puri
2015 Synthesis Lectures on Computer Architecture  
We then use this information to characterize and recommend suitable parallelization strategies for these algorithms, specifically when used in data management workloads.  ...  In this survey paper, we identify some of the key techniques employed in analytics both to serve as an introduction for the non-specialist and to explore the opportunity for greater optimizations for parallelization  ...  on GPUs, and application-specific parallelism using Field Programmable Gate Arrays (FPGAs).  ... 
doi:10.2200/s00678ed1v01y201511cac035 fatcat:jkjywe5rzzaupjwq5rjyavqxi4

Performance Analysis and Optimization of the OP2 Framework on Many-Core Architectures

M. B. Giles, G. R. Mudalige, Z. Sharif, G. Markall, P. H. J. Kelly
2011 Computer journal  
Our analysis demonstrates the contrasting performance between the use of CPU (OpenMP) and GPU (CUDA) parallel implementations for the solution of an industrial-sized unstructured mesh consisting of about  ...  This paper presents a benchmarking, performance analysis and optimization study of the OP2 'active' library, which provides an abstraction framework for the parallel execution of unstructured mesh applications  ...  (iii) Our analysis demonstrates the performance issues that distinguish the use of CPU and GPU architectures to execute the Airfoil application.  ... 
doi:10.1093/comjnl/bxr062 fatcat:jwy76b33wjdvtff67i5gl3rppa

Are GPUs Non-Green Computing Devices?

Martín Pi Puig, Laura De Giusti, Marcelo Naiouf
2018 Journal of Computer Science and Technology  
This paper analyzes a set of applications from the Rodinia benchmark suite in terms of CPU and GPU performance and energy consumption.  ...  Specifically, it compares single-threaded and multi-threaded CPU versions with GPU implementations, and characterize the execution time, true instant power and average energy consumption to test the idea  ...  Therefore, the number of devices with GPUs and the amount of GPU accelerated applications increased more and more over the past years.  ... 
doi:10.24215/16666038.18.e17 fatcat:62hjnklnm5auphus7vdmlheuea

Empirical characterization of power efficiency for large scale data processing

Yongbin Lee, Sungchan Kim
2015 2015 17th International Conference on Advanced Communication Technology (ICACT)  
To this end, this paper aims at characterizing the power efficiency of CPUs and GPUs for big data processing through empirical measurements.  ...  We take three recent computing units, high-end CPU, and GPU, and mobile embedded GPU as target platforms.  ...  Acknowledgements This work was supported by ICT R&D program of MSIP/IITP (B0101-15-0661, the research and development of the self-adaptive software framework for various IoT devices).  ... 
doi:10.1109/icact.2015.7224902 fatcat:z3eoikyqtrdedecmppsc7siopu

An Empirical-cum-Statistical Approach to Power-Performance Characterization of Concurrent GPU Kernels [article]

Nilanjan Goswami, Amer Qouneh, Chao Li, Tao Li
2020 arXiv   pre-print
Growing deployment of power and energy efficient throughput accelerators (GPU) in data centers demands enhancement of power-performance co-optimization capabilities of GPUs.  ...  On average, our analysis reveals that spatial and temporal concurrency within kernel execution in throughput architectures saves energy consumption by 32%, 26% and 33% in GTX470, Tesla M2050 and Tesla  ...  Microarchitecture and use scope variation ensure robust workload analysis. Instead of simulation-based approach, we use real GPU based profiling data to guarantee a widely applicable result.  ... 
arXiv:2011.02368v2 fatcat:xgce6gvcjjcilfwem452yd3hsi

CFD Problems Solving Parallel Approaches on Supercomputers

Tatiana Kudryashova, Sergey Polyakov, Viktoriia Podryga, N. Mastorakis, V. Mladenov, A. Bulucea
2016 MATEC Web of Conferences  
The paper summarizes our experience in solving various practical problems of gas dynamics.  ...  The work is devoted to developing and testing parallel algorithms, suit of computer programs for numerical solution of CFD-problems on modern supercomputers.  ...  Therefore, the applications must explicitly copy the data in memory VPU or GPU and back. This communication is slow and inefficient.  ... 
doi:10.1051/matecconf/20167604025 fatcat:tx5jrcwnmndrzhjf6tj53bnezi

Characterizing Power and Energy Efficiency of Legion Data-Centric Runtime and Applications on Heterogeneous High-Performance Computing Systems [chapter]

Song Huang, Song Fu, Scott Pakin, Michael Lang
2018 High Performance Parallel Computing [Working Title]  
In this chapter, we aim to characterize the power and energy consumption of running HPC applications on Legion.  ...  We run benchmark applications on compute nodes equipped with both CPU and GPU, and measure the execution time, power consumption and CPU/GPU utilization.  ...  Department of Energy contract DE-AC52-06NA25396. This chapter has been assigned the LANL identifier LA-UR-16-25965.  ... 
doi:10.5772/intechopen.81124 fatcat:u6ht6prfc5hvnkouektlxq5mpi


2016 Procedia Computer Science  
and Nodes' States Dynamics as an Early-Warning Signal of Critical Transition in a Banking Network Model V.Y.  ...  for CFD Calculations of Unstructured Grids J.  ... 
doi:10.1016/s1877-0509(16)31051-1 fatcat:pewr5t3hq5fqjike3f6wr2k6pu

GPU based Suffix Array Pattern Matching Approach for Big Data

Vinay Katoch, Sanjay Silakari, Uday Chourasia
2017 International Journal of Computer Applications  
To solve this problem Hadoop has evolved as a most widely used tool and adopted by various popular MNCs like Facebook and Yahoo. To search large number of pattern in big data is a challenging task.  ...  In this work OpenCL is combined with Apache Hadoop to write fast Map/Reduce for pattern matching in data using suffix arrays.  ...  Map-Reduce Map-Reduce applications can precede multiple terabytes of data in parallel on large clusters in a fault-tolerant manner and reliable.  ... 
doi:10.5120/ijca2017914668 fatcat:5vvinxi5qzazdazlt4d4pbdice

Design and initial performance of a high-level unstructured mesh framework on heterogeneous parallel systems

G.R. Mudalige, M.B. Giles, J. Thiyagalingam, I.Z. Reguly, C. Bertolli, P.H.J. Kelly, A.E. Trefethen
2013 Parallel Computing  
OP2 targets the solution of numerical problems based on static unstructured meshes. We discuss the main design issues in parallelizing this class of applications.  ...  These include handling data dependencies in accessing indirectly referenced data and design considerations in generating code for execution on a cluster of multi-threaded CPUs and GPUs.  ...  of benchmarking systems used in this paper.  ... 
doi:10.1016/j.parco.2013.09.004 fatcat:64bue7g3djch5h7c5jci3uvatq
« Previous Showing results 1 — 15 out of 1,818 results