958 Hits in 3.3 sec

Phase correlation processing for DPIV measurements

Adric C. Eckstein, John Charonko, Pavlos Vlachos
2008 Experiments in Fluids  
The PIV application is mapped to a Nvidia GPU system, resulting in 3x speedup over a dual quad-core Intel processor implementation.  ...  The design methodology used to implement the PIV application on a specialized FPGA platform under development is described in brief and the resulting performance benefit is analyzed.  ...  Srinidhi Varadarajan for providing access to System G. The Nvidia Tesla C1060 GPU was funded through the Nvidia Professor Partnership program.  ... 
doi:10.1007/s00348-008-0492-6 fatcat:hb2ktoovsveslkytejyxmi3uhm

Tree structured analysis on GPU power study

Jianmin Chen, Bin Li, Ying Zhang, Lu Peng, Jih-kwon Peir
2011 2011 IEEE 29th International Conference on Computer Design (ICCD)  
We measure the power consumption of a wide-range of CUDA kernels on an experimental system with GTX 280 GPU to collect statistical samples for power analysis.  ...  In this paper, we present a high-level GPU power consumption model using sophisticated tree-based random forest methods which correlate and predict the power consumption using a set of performance variables  ...  These applications are parallelized using CUDA and run on an experimental system with GTX 280.  ... 
doi:10.1109/iccd.2011.6081376 dblp:conf/iccd/ChenLZPP11 fatcat:cnmgboj5svcnpmjegqf5jtymp4

A characterization and analysis of PTX kernels

Andrew Kerr, Gregory Diamos, Sudhakar Yalamanchili
2009 2009 IEEE International Symposium on Workload Characterization (IISWC)  
This paper proposes a set of metrics for GPU workloads and uses these metrics to analyze the behavior of GPU programs.  ...  We report on an analysis of over 50 kernels and applications including the full NVIDIA CUDA SDK and UIUC's Parboil Benchmark Suite covering control flow, data flow, parallelism, and memory behavior.  ...  We also thank David Kaeli, Hyesoon Kim, and Nagesh Lakshminarayana for their insightful comments on this paper.  ... 
doi:10.1109/iiswc.2009.5306801 dblp:conf/iiswc/KerrDY09 fatcat:mz3sbt3drrdnlbo46nqm5gtwmi

A Comprehensive Performance Comparison of CUDA and OpenCL

Jianbin Fang, Ana Lucia Varbanescu, Henk Sips
2011 2011 International Conference on Parallel Processing  
This paper presents a comprehensive performance comparison between CUDA and OpenCL. We have selected 16 benchmarks ranging from synthetic applications to real-world ones.  ...  We make an extensive analysis of the performance gaps taking into account programming models, optimization strategies, architectural details, and underlying compilers.  ...  ACKNOWLEDGMENT We would like to thank the authors from the SHOC benchmark suite, the CUDA SDK and the Rodinia benchmark suite for their valuable benchmarks.  ... 
doi:10.1109/icpp.2011.45 dblp:conf/icpp/FangVS11 fatcat:ema4sxt2sjh37mlbljsddpy4gm

A CUDA-enabled Parallel Implementation of Collaborative Filtering

Zhongya Wang, Ying Liu, Pengshan Ma
2014 Procedia Computer Science  
Collaborative filtering (CF) is one of the essential algorithms in recommendation system. Based on the performance analysis, two computational kernels are identified.  ...  In order to accelerate CF on large-scale data, a CUDA-enabled parallel CF approach is proposed where an efficient data partition scheme is proposed as well.  ...  It is partially supported by NVIDIA as we were awarded as CUDA Teaching Center in 2011 and CUDA Research Center in 2013.  ... 
doi:10.1016/j.procs.2014.05.382 fatcat:q7a5bmm53re4tbokw46zql3wxu

Analyzing Machine Learning Workloads Using a Detailed GPU Simulator [article]

Jonathan Lew, Deval Shah, Suchita Pati, Shaylin Cattell, Mengchi Zhang, Amruth Sandhupatla, Christopher Ng, Negar Goli, Matthew D. Sinclair, Timothy G. Rogers, Tor Aamodt
2019 arXiv   pre-print
Most deep neural networks deployed today are trained using GPUs via high-level frameworks such as TensorFlow and PyTorch.  ...  Using GPGPU-Sim's AerialVision performance analysis tool we observe that cuDNN API calls contain many varying phases and appear to include potentially inefficient microarchitecture behaviour such as DRAM  ...  Additional CUDA Language Support NVIDIA's CUDA enables overlapping memory copies from CPU to GPU with computation on the GPU via a construct known as streams (similar to a command queue in OpenCL).  ... 
arXiv:1811.08933v2 fatcat:25pubfhlqndxzkwxniylse45wm

Lynx: A dynamic instrumentation system for data-parallel applications on GPGPU architectures

Naila Farooqui, Andrew Kerr, Greg Eisenhauer, Karsten Schwan, Sudhakar Yalamanchili
2012 2012 IEEE International Symposium on Performance Analysis of Systems & Software  
Lynx is embedded into the broader GPU Ocelot system, which provides run-time code generation of CUDA programs for heterogeneous architectures.  ...  GPU (GPGPU) based systems, and (3) useful performance metrics described via Lynx's instrumentation language that provide insights into the design of effective instrumentation routines for GPGPU systems  ...  Although dynamic instrumentation has been proven to be a useful program analysis technique for traditional architectures [1] , it has not been fully exploited for GPU-based systems.  ... 
doi:10.1109/ispass.2012.6189206 dblp:conf/ispass/FarooquiKESY12 fatcat:pcfzrlhmorhq7c2i4ppjzfyuy4

Motion Detection in Low Resolution Grayscale Videos Using Fast Normalized Cross Correrelation on GP-GPU

Durgaprasad Gangodkar, Gurbinder Singh Gill, Sachin Gupta, Padam Kumar, Dr.Ankush Mittal
2011 International Journal of Electronics Signals and Systems  
In this paper we propose real time implementation of full search FCC algorithm applied to gray scale videos using NVIDIA's Compute Unified Device Architecture (CUDA).  ...  We show that by efficient utilization of global, shared and texture memories available on CUDA, we can obtain the speedup of the order of 10x as compared to the sequential implementation of FCC.  ...  NVIDIA's Compute Unified Device Architecture NVIDIA's CUDA [23], a general purpose computing architecture on a GPU, provides avenues for active research to tackle compute intensive tasks.  ... 
doi:10.47893/ijess.2011.1021 fatcat:t7y4ia6bjbdubcgsmjzvzqkday

GPU clusters for high-performance computing

Volodymyr V. Kindratenko, Jeremy J. Enos, Guochun Shi, Michael T. Showerman, Galen W. Arnold, John E. Stone, James C. Phillips, Wen-mei Hwu
2009 2009 IEEE International Conference on Cluster Computing and Workshops  
Large-scale GPU clusters are gaining popularity in the scientific computing community. However, their deployment and production use are associated with a number of new challenges.  ...  In this paper, we present our efforts to address some of the challenges with building and running GPU clusters in HPC environments.  ...  CUDA C Currently, NVIDIA's CUDA toolkit is the most widely used GPU programming toolkit available.  ... 
doi:10.1109/clustr.2009.5289128 dblp:conf/cluster/KindratenkoESSASPH09 fatcat:c7hmiyq725bethpk7zhnqdzsla

Modeling Deep Learning Accelerator Enabled GPUs [article]

Md Aamir Raihan, Negar Goli, Tor Aamodt
2019 arXiv   pre-print
When implemented a GPU simulator, GPGPU-Sim, our tensor core model achieves 99.6\% correlation versus an NVIDIA Titan~V GPU in terms of average instructions per cycle when running tensor core enabled GEMM  ...  The efficacy of deep learning has resulted in its use in a growing number of applications.  ...  ACKNOWLEDGMENT We thank Francois Demoullin, Deval Shah, Dave Evans, Bharadwaj Machiraju, Yash Ukidave and the anonymous reviewers for their valuable comments on this work.  ... 
arXiv:1811.08309v2 fatcat:sjdjievr55hfjc7vqd6ilo4vjm

Low-cost, high-speed computer vision using NVIDIA's CUDA architecture

Seung In Park, Sean P. Ponce, Jing Huang, Yong Cao, Francis Quek
2008 2008 37th IEEE Applied Imagery Pattern Recognition Workshop  
By utilizing NVIDIA's new GPU programming framework, "Compute Unified Device Architecture" (CUDA) as a computational resource, we realize significant acceleration in image processing algorithm computations  ...  In this paper, we introduce real time image processing techniques using modern programmable Graphic Processing Units (GPU).  ...  As an example, the architecture of the GeForce 8 Series, the eighth generation of NVIDIA's graphics cards, based on CUDA is shown in Figure 2 .  ... 
doi:10.1109/aipr.2008.4906458 dblp:conf/aipr/ParkPHCQ08 fatcat:rxfoqkw63be7bhovwqaegyccem

Design space explorations for streaming accelerators using Streaming Architectural Simulator

Muhammad Shafiq, M. Pericas, N. Navarro, E. Ayguade
2013 Proceedings of 2013 10th International Bhurban Conference on Applied Sciences & Technology (IBCAST)  
Our design space explorations for different architectural aspects of a GPU like device are with reference to a base line established for NVIDIA's Fermi architecture (GPU Tesla C2050).  ...  (ii) We use our simulation tool-chain for the design space explorations of the GPU like streaming architectures.  ...  On the GPU side, we use nvcc compiler with cuda compilation tool release 4.0, V0.2.1221. We compiled the the CUDA codes using optimization level 3.  ... 
doi:10.1109/ibcast.2013.6512151 fatcat:s7shhkgqlnb4baucblqnxqrvyi

Modeling GPU-CPU workloads and systems

Andrew Kerr, Gregory Diamos, Sudhakar Yalamanchili
2010 Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units - GPGPU '10  
This paper reports on an empirical evaluation of 25 CUDA applications on four GPUs and three CPUs, leveraging the Ocelot dynamic compiler infrastructure which can execute and instrument the same CUDA applications  ...  Using a combination of instrumentation and statistical analysis, we record 37 different metrics for each application and use them to derive relationships between program behavior and performance on heterogeneous  ...  Principal Component Analysis Principal Component Analysis (PCA) is predicated on the assumption that several variables used in an analysis are correlated, and therefore measure the same property of an  ... 
doi:10.1145/1735688.1735696 dblp:conf/asplos/KerrDY10 fatcat:wnmw2zpc5vf6dhgo6slczg7ila

Scalable Parallel Motion Estimation on Muti-GPU System

Dong Chen, Hua You Su, Wen Mei, Li Xuan Wang, Chun Yuan Zhang
2013 Applied Mechanics and Materials  
With NVIDIA's parallel computing architecture CUDA, using GPU to speed up compute-intensive applications has become a research focus in recent years.  ...  Based on the analysis of data dependency and multi-GPU architecture, a parallel computing model and a communication model are designed.  ...  Xinbiao Gan proposed a parallel full search motion estimation algorithm using CUDA based on one GPU system in 2010 [4] .  ... 
doi:10.4028/ fatcat:37ph2y3kuvcldh2a7ajanzpv2q

Warp-aware trace scheduling for GPUs

James A. Jablin, Thomas B. Jablin, Onur Mutlu, Maurice Herlihy
2014 Proceedings of the 23rd international conference on Parallel architectures and compilation - PACT '14  
As evaluated on the Rodinia Benchmark Suite using dynamic profiling, our fully-automatic optimization achieves a geometric mean speedup of 1.10× on a real system by increasing instructions executed per  ...  GPU performance depends not only on thread/warp level parallelism (TLP) but also on instruction-level parallelism (ILP).  ...  weaken dependence on NVIDIA's GPU toolchain to explicitly lengthen GPU traces).  ... 
doi:10.1145/2628071.2628101 dblp:conf/IEEEpact/JablinJMH14 fatcat:crjqndrorjhiddj5c2hmvmlbom
« Previous Showing results 1 — 15 out of 958 results