Filters








469 Hits in 4.6 sec

Intra-node Memory Safe GPU Co-Scheduling [article]

Carlos Reano, Federico Silla, Dimitrios S. Nikolopoulos, Blesson Varghese
2017 arXiv   pre-print
The research reported in this paper is motivated to improve the utilisation of GPUs by proposing a framework, we refer to as schedGPU, to facilitate intra-node GPU co-scheduling such that a GPU can be  ...  GPUs in High-Performance Computing systems remain under-utilised due to the unavailability of schedulers that can safely schedule multiple applications to share the same GPU.  ...  In this paper, we propose an intra-node, memory safe GPU co-scheduling framework, referred to as schedGPU.  ... 
arXiv:1712.04495v1 fatcat:mlth4j53wbcpdbyhfjdjiufbxe

Intra-Node Memory Safe GPU Co-Scheduling

Carlos Reano, Federico Silla, Dimitrios S. Nikolopoulos, Blesson Varghese
2018 IEEE Transactions on Parallel and Distributed Systems  
In this paper, we propose an intra-node, memory safe GPU co-scheduling framework, referred to as schedGPU.  ...  This is because there is no safe handling of GPU memory requirements for co-scheduling jobs.  ...  His research addresses high performance on-chip and off-chip interconnection networks, distributed memory systems and remote GPU virtualisation mechanisms.  ... 
doi:10.1109/tpds.2017.2784428 fatcat:mapkdpljafaofk4blqucy4igny

MaxPair

Yuan Wen, Michael F.P. O'Boyle, Christian Fensch
2018 Proceedings of the 11th Workshop on General Purpose GPUs - GPGPU-11  
Schemes of scheduling impact the resulting performance significantly by selecting different kernels to run together on the same GPU.  ...  In this paper, we propose a graph-based algorithm to schedule co-run kernel in pairs to optimise the system performance.  ...  The fairness of concurrent GPU application can be improved by augmenting memory scheduling scheme [10, 24] , or by providing virtual memory system [4] or making the GPU device preemptible [17, 21,  ... 
doi:10.1145/3180270.3180272 dblp:conf/ppopp/WenOF18 fatcat:dl35j6yebfhrbkfsjoviekevc4

Effective GPU Sharing Under Compiler Guidance [article]

Chao Chen, Chris Porter, Santosh Pande
2021 arXiv   pre-print
Modern computing platforms tend to deploy multiple GPUs (2, 4, or more) on a single node to boost system performance, with each GPU having a large capacity of global memory and streaming multiprocessors  ...  based on the task's resource requirements and devices' load in a memory-safe, resource-aware manner.  ...  For many of these systems, each computing node is equipped with multiple GPU devices.  ... 
arXiv:2107.08538v1 fatcat:54qe3b4stbdmnmq5j2ikxw2rta

Preliminary experiences with the uintah framework on Intel Xeon Phi and stampede

Qingyu Meng, Alan Humphrey, John Schmidt, Martin Berzins
2013 Proceedings of the Conference on Extreme Science and Engineering Discovery Environment Gateway to Discovery - XSEDE '13  
for asynchronous, out-of-order scheduling of both CPU and GPU computational tasks.  ...  accelerators and co-processors, deep memory hierarchies, as well as managing multiple levels of parallelism.  ...  Uintah use Pthreads for intra-node task scheduling. Each core directly pulls tasks from multi-stage ready-task queues without any intra-node communications taking place.  ... 
doi:10.1145/2484762.2484779 dblp:conf/xsede/MengHSB13 fatcat:ts6bj33qxzh55glx6jfiwaght4

Quantifying Data Locality in Dynamic Parallelism in GPUs

Xulong Tang, Ashutosh Pattnaik, Onur Kayiran, Adwait Jog, Mahmut Taylan Kandemir, Chita Das
2018 Proceedings of the ACM on Measurement and Analysis of Computing Systems  
Dynamic parallelism (DP) is a new feature of emerging GPUs that allows new kernels to be generated and scheduled from the device-side (GPU) without the host-side (CPU) intervention to increase parallelism  ...  There have been considerable eforts focusing on exploiting data locality in GPUs.  ...  To further improve data locality in GPUs, our scheduling strategies can co-exist with other locality-aware cache optimizations [17, 27, 41] .  ... 
doi:10.1145/3287318 fatcat:zmop6pak6jefve6jtypricmo2a

Runtime Adaptation for Autonomic Heterogeneous Computing

Thomas R.W. Scogland, Wu-Chun Feng
2014 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing  
Heterogeneity is increasing across all levels of computing, with the rise of accelerators such as GPUs, FPGAs, and other coprocessors into everything from cell phones to supercomputers.  ...  We also discuss our current work towards the next generation of fine-grained scheduling and synchronization across heterogeneous platforms in the design of a highly-scalable and portable concurrent queue  ...  Table 6 . 6 1: Memory bandwidth in MB/s from each memory node to each CPU/GPU node CPU-only.  ... 
doi:10.1109/ccgrid.2014.23 dblp:conf/ccgrid/ScoglandF14 fatcat:eat26fkykvebnfeqpuuwhhbfbu

Addressing GPU On-Chip Shared Memory Bank Conflicts Using Elastic Pipeline

Chunyang Gou, Georgi N. Gaydadjiev
2012 International journal of parallel programming  
One of the major problems with the GPU on-chip shared memory is bank conflicts.  ...  Simulation results show that our proposed Elastic Pipeline together with the co-designed bank-conflict aware warp scheduling reduces the pipeline stalls by up to 64.0 % (with 42.3 % on average) and improves  ...  Accordingly, the safe memory warp schedule distance in Eq.  ... 
doi:10.1007/s10766-012-0201-1 fatcat:bgri4uzhhrax7a6b6tpcwq4fly

Cooperative Kernels: GPU Multitasking for Blocking Algorithms (Extended Version) [article]

Tyler Sorensen, Hugues Evrard, Alastair F. Donaldson
2017 arXiv   pre-print
But GPU programming models (e.g.\ OpenCL) do not mandate fair scheduling, and GPU schedulers are unfair in practice.  ...  Current approaches avoid this issue by exploiting scheduling quirks of today's GPUs in a manner that does not allow the GPU to be shared with other workloads (such as graphics rendering tasks).  ...  Global memory is shared among all device threads. Each workgroup has a portion of local memory for fast intra-workgroup communication.  ... 
arXiv:1707.01989v1 fatcat:6crskrc4uvfuzmiwg646womeke

Heterogeneous Parallelization and Acceleration of Molecular Dynamics Simulations in GROMACS [article]

Szilárd Páll, Artem Zhmurov, Paul Bauer, Mark Abraham, Magnus Lundborg, Alan Gray, Berk Hess, Erik Lindahl
2020 arXiv   pre-print
Combined with new direct GPU-GPU communication as well as GPU integration, this enables excellent performance from single GPU simulations through strong scaling across multiple GPUs and efficient multi-node  ...  between CPUs and GPUs.  ...  Hierarchical memory as well as intra-and inter-node interconnects facilitate handling data close to compute units.  ... 
arXiv:2006.09167v2 fatcat:b6jiwmemtvbn3cz3mjfphbfeiu

Cooperative kernels: GPU multitasking for blocking algorithms

Tyler Sorensen, Hugues Evrard, Alastair F. Donaldson
2017 Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering - ESEC/FSE 2017  
OpenCL) do not mandate fair scheduling, and GPU schedulers are unfair in practice.  ...  There is growing interest in accelerating irregular data-parallel algorithms on GPUs. These algorithms are typically blocking, so they require fair scheduling. But GPU programming models (e.g.  ...  Each workgroup has a portion of local memory for fast intra-workgroup communication. Every thread has a portion of very fast private memory for function-local variables.  ... 
doi:10.1145/3106237.3106265 dblp:conf/sigsoft/SorensenED17 fatcat:2vyp6qhwkvbunliy3ntykar57a

Modern Gyrokinetic Particle-In-Cell Simulation of Fusion Plasmas on Top Supercomputers [article]

Bei Wang and Stephane Ethier and William Tang and Khaled Ibrahim and Kamesh Madduri and Samuel Williams and Leonid Oliker
2015 arXiv   pre-print
GTC-P's multiple levels of parallelism, including inter-node 2D domain decomposition and particle decomposition, as well as intra-node shared memory partition and vectorization have enabled pushing the  ...  This particularly includes implementations on heterogeneous systems using NVIDIA GPU accelerators and Intel Xeon Phi (MIC) co-processors and performance comparisons with state-of-the-art homogeneous HPC  ...  Co-authors from the Lawrence Berkeley National Laboratory (LBNL) were supported by the DOE-SC funds from contract number DE-AC02-05CH11231.  ... 
arXiv:1510.05546v1 fatcat:ebey3b5dyjhkhfa4moy7dpmx7e

Modern gyrokinetic particle-in-cell simulation of fusion plasmas on top supercomputers

Bei Wang, Stephane Ethier, William Tang, Khaled Z Ibrahim, Kamesh Madduri, Samuel Williams, Leonid Oliker
2017 The international journal of high performance computing applications  
GTC-P's multiple levels of parallelism, including inter-node 2D domain decomposition and particle decomposition, as well as intra-node shared memory partition and vectorization have enabled pushing the  ...  The multiple levels of parallelism, including inter-node 2D domain decomposition and particle decomposition, and intra-node shared memory partition, as well as vectorization within each core, have allowed  ...  Co-authors from the Lawrence Berkeley National Laboratory (LBNL) were supported by the DOE-SC funds from contract number DE-AC02-05CH11231.  ... 
doi:10.1177/1094342017712059 fatcat:76inm7ykqfbu7jdi2egzjtdjnm

Technical Report about Tiramisu: a Three-Layered Abstraction for Hiding Hardware Complexity from DSL Compilers [article]

Riyadh Baghdadi, Jessica Ray, Malek Ben Romdhane, Emanuele Del Sozzo, Patricia Suriana, Shoaib Kamil, Saman Amarasinghe
2018 arXiv   pre-print
and GPUs.  ...  As a result, DSL compilers can be made considerably less complex with no loss of performance while immediately targeting multiple hardware or hardware combinations such as distributed nodes with both CPUs  ...  The separation ensures that the scheduling phase can safely assume no data-layout transformations are required, greatly simplifying scheduling transformations.  ... 
arXiv:1803.00419v3 fatcat:6ev6fcbhw5bttdnx6c2iqm7hge

GPU-Aware Non-contiguous Data Movement In Open MPI

Wei Wu, George Bosilca, Rolf vandeVaart, Sylvain Jeaugey, Jack Dongarra
2016 Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing - HPDC '16  
In this design the datatype packing and unpacking operations are offloaded onto the GPU and handled by specialized GPU kernels, while the CPU remains the driver for data movements between nodes.  ...  By incorporating our design into the Open MPI library we have shown significantly better performance for non-contiguous GPU-resident data transfers on both shared and distributed memory machines.  ...  In intra-node communications, CUDA IPC allows the GPU memory of one process to be exposed to the others, and therefore provides a one sided copy mechanism similar to RDMA.  ... 
doi:10.1145/2907294.2907317 dblp:conf/hpdc/WuBVJD16 fatcat:zxajcyjjtbe2riwgei7nhzdkgy
« Previous Showing results 1 — 15 out of 469 results