Filters








185 Hits in 4.4 sec

MGSim + MGMark: A Framework for Multi-GPU System Research [article]

Yifan Sun, Trinayan Baruah, Saiful A. Mojumder, Shi Dong, Rafael Ubal, Xiang Gong, Shane Treadway, Yuhui Bao, Vincent Zhao, José L. Abellán, John Kim, Ajay Joshi, David Kaeli
2018 arXiv   pre-print
(D-MGPU) that both utilize unified memory space and cross-GPU memory access.  ...  The rapidly growing popularity and scale of data-parallel workloads demand a corresponding increase in raw computational power of GPUs (Graphics Processing Units).  ...  At the same time, there are new challenges associated with multi-GPU systems, including how to handle GPU-to-GPU communication, memory management across the unified CPU and multi-GPU memory space, and  ... 
arXiv:1811.02884v3 fatcat:uqzjyera75dnnnpfeduess7qtq

DIDO: Dynamic Pipelines for In-Memory Key-Value Stores on Coupled CPU-GPU Architectures

Kai Zhang, Jiayu Hu, Bingsheng He, Bei Hua
2017 2017 IEEE 33rd International Conference on Data Engineering (ICDE)  
Our experiments have shown the effectiveness of DIDO in significantly enhancing the system throughput for diverse workloads.  ...  This special property opens up new opportunities for building in-memory keyvalue store systems, as it eliminates the data transfer costs on PCI-e bus, and enables fine-grained cooperation between the CPU  ...  ACKNOWLEDGEMENT This work is partially funded by a MoE AcRF Tier 1 grant (T1 251RES1610), a startup grant of NUS in Singapore and NSFC Project 61628204 in China.  ... 
doi:10.1109/icde.2017.120 dblp:conf/icde/ZhangHHH17 fatcat:txh5rqnhgnc53hfcbtgkoifkty

C-SAW: A Framework for Graph Sampling and Random Walk on GPUs [article]

Santosh Pandey, Lingda Li, Adolfy Hoisie, Xiaoye S. Li, Hang Liu
2020 arXiv   pre-print
Third, towards supporting graphs that exceed GPU memory capacity, we introduce efficient data transfer optimizations for out-of-memory sampling, such as workload-aware scheduling and batched multi-instance  ...  Third, towards supporting graphs that exceed the GPU memory capacity, we introduce efficient data transfer optimizations for out-of-memory and multi-GPU sampling, such as workload-aware scheduling and  ...  GPU unified memory and partition-centric are viable method for out-of-memory graph processing. Since graph sampling is irregular, unified memory is not a suitable option [79] , [80] .  ... 
arXiv:2009.09103v1 fatcat:m2j6vv7twnhjpimnme42jsf5aa

Java with Auto-parallelization on Graphics Coprocessing Architecture

Guodong Han, Chenggang Zhang, King Tin Lam, Cho-Li Wang
2013 2013 42nd International Conference on Parallel Processing  
workloads efficiently across the CPU-GPU border.  ...  GPU-based many-core accelerators have gained a footing in supercomputing.  ...  We adopt Java as the target language for unifying CPU and GPU programming in view of its popularity.  ... 
doi:10.1109/icpp.2013.62 dblp:conf/icpp/HanZLW13 fatcat:rw6ndtwjszddppqtkh5ipmks2i

Benchmarking Graph Data Management and Processing Systems: A Survey [article]

Miyuru Dayarathna, Toyotaro Suzumura
2021 arXiv   pre-print
We conduct an in-depth study of the existing literature on benchmarks for graph data management and processing, covering 20 different benchmarks developed during the last 15 years.  ...  This systematic approach allows us to identify multiple issues existing in this area, including i) few benchmarks exist which can produce high workload scenarios, ii) no significant work done on benchmarking  ...  GB memory per GPU), Nvidia GeForce GTX 480 GPU (onboard 1.5 GB memory) 2017 Liu et al  ... 
arXiv:2005.12873v4 fatcat:jh3367b4vjaqbgyvaccjnxqjfi

HeTM: Transactional Memory for Heterogeneous Systems [article]

Daniel Castro, Paolo Romano, Aleksandar Ilic, Amin M. Khan
2019 arXiv   pre-print
HeTM provides programmers with the illusion of a single memory region, shared among the CPUs and the (discrete) GPU(s) of a heterogeneous system, with support for atomic transactions.  ...  the TM implementation (e.g., in hardware or software) that best fits the applications' workload and the architectural characteristics of the processing unit.  ...  There are a number of ongoing efforts in the academia and industry aimed to automate data management and to unify memory in hybrid accelerated systems, for example, at the compiler level (CGCM [29] ,  ... 
arXiv:1905.00661v2 fatcat:nxiihazahrc3xnyptkqo35ke3e

RLlib: Abstractions for Distributed Reinforcement Learning [article]

Eric Liang, Richard Liaw, Philipp Moritz, Robert Nishihara, Roy Fox, Ken Goldberg, Joseph E. Gonzalez, Michael I. Jordan, Ion Stoica
2018 arXiv   pre-print
We argue for distributing RL components in a composable way by adapting algorithms for top-down hierarchical control, thereby encapsulating parallelism and resource requirements within short-running compute  ...  Reinforcement learning (RL) algorithms involve the deep nesting of highly irregular computation patterns, each of which typically exhibits opportunities for distributed computation.  ...  Acknowledgements In addition to NSF CISE Expeditions Award CCF-1730628, this research is supported in part by DHS Award HSHQDC-16-3-00083, and gifts from Alibaba, Amazon Web Services, Ant Financial, Arm  ... 
arXiv:1712.09381v4 fatcat:ihhwdewi4bfndags5x5c65mfaa

BGL: GPU-Efficient GNN Training by Optimizing Graph Data I/O and Preprocessing [article]

Tianfeng Liu
2021 arXiv   pre-print
The main bottlenecks are the process of preparing data for GPUs - subgraph sampling and feature retrieving.  ...  Nonetheless, existing systems are inefficient to train large graphs with billions of nodes and edges with GPUs.  ...  GNNAdvisor [47] explores the GNN input properties and proposes a 2D workload management and specialized memory customization for system optimizations.  ... 
arXiv:2112.08541v1 fatcat:kzel63n3ircqdpcuie2d4jd7y4

BAG: Managing GPU as Buffer Cache in Operating Systems

Hao Chen, Jianhua Sun, Ligang He, Kenli Li, Huailiang Tan
2014 IEEE Transactions on Parallel and Distributed Systems  
Unlike previous uses of GPUs, which have focused on the computational capabilities of GPUs, BAG is designed to explore a new dimension in managing GPUs in heterogeneous systems where the GPU memory is  ...  With the carefully designed data structures and algorithms, such as concurrent hashtable, log-structured data store for the management of GPU memory, and highly-parallel GPU kernels for garbage collection  ...  ACKNOWLEDGMENTS The authors are grateful to the anonymous reviewers for their helpful feedback. This research was supported in part by the National Natural Science  ... 
doi:10.1109/tpds.2013.201 fatcat:bwlhs7rbpbh2bdr76sg2soujwe

Debunking the 100X GPU vs. CPU myth

Victor W. Lee, Per Hammarlund, Ronak Singhal, Pradeep Dubey, Changkyu Kim, Jatin Chhugani, Michael Deisher, Daehyun Kim, Anthony D. Nguyen, Nadathur Satish, Mikhail Smelyanskiy, Srinivas Chennupaty
2010 Proceedings of the 37th annual international symposium on Computer architecture - ISCA '10  
Our analysis of a set of important throughput computing kernels shows that there is an ample amount of parallelism in these kernels which makes them suitable for today's multi-core CPUs and GPUs.  ...  In this paper, we discuss optimization techniques for both CPU and GPU, analyze what architecture features contributed to performance differences between the two architectures, and recommend a set of architectural  ...  Due to the fact that cache coherence is not available on today's GPUs, assuring memory consistency between two batches of constraints requires launching the second batch from the CPU host, which incurs  ... 
doi:10.1145/1815961.1816021 dblp:conf/isca/LeeKCDKNSSCHSD10 fatcat:7dgqdsykarcwhp22t7oxgawwza

Debunking the 100X GPU vs. CPU myth

Victor W. Lee, Per Hammarlund, Ronak Singhal, Pradeep Dubey, Changkyu Kim, Jatin Chhugani, Michael Deisher, Daehyun Kim, Anthony D. Nguyen, Nadathur Satish, Mikhail Smelyanskiy, Srinivas Chennupaty
2010 SIGARCH Computer Architecture News  
Our analysis of a set of important throughput computing kernels shows that there is an ample amount of parallelism in these kernels which makes them suitable for today's multi-core CPUs and GPUs.  ...  In this paper, we discuss optimization techniques for both CPU and GPU, analyze what architecture features contributed to performance differences between the two architectures, and recommend a set of architectural  ...  Due to the fact that cache coherence is not available on today's GPUs, assuring memory consistency between two batches of constraints requires launching the second batch from the CPU host, which incurs  ... 
doi:10.1145/1816038.1816021 fatcat:pxizpaiizrdq7gmsfs45obdqwy

Marius: Learning Massive Graph Embeddings on a Single Machine [article]

Jason Mohoney, Roger Waleffe, Yiheng Xu, Theodoros Rekatsinas, Shivaram Venkataraman
2021 arXiv   pre-print
550 GB of total parameters on a single machine with 16 GB of GPU memory and 64 GB of CPU memory.  ...  We propose Marius, a system for efficient training of graph embeddings that leverages partition caching and buffer-aware data orderings to minimize disk access and interleaves data movement with computation  ...  Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon.  ... 
arXiv:2101.08358v2 fatcat:je66eltrgzbmnoilc3y3ai6vqa

Dynamic Task Parallelism with a GPU Work-Stealing Runtime System [chapter]

Sanjay Chatterjee, Max Grossman, Alina Sbîrlea, Vivek Sarkar
2013 Lecture Notes in Computer Science  
We introduce a finish-async style API to GPU device programming with the aim of executing irregular applications efficiently across multiple shared multiprocessors (SM) in a GPU device without sacrificing  ...  The high number of computational cores and high memory bandwidth supported by the device makes it an ideal candidate for such applications.  ...  NVIDIA provides a C/C++ based API for programming their GPUs called the Compute Unified Device Architecture (CUDA) programming model.  ... 
doi:10.1007/978-3-642-36036-7_14 fatcat:ahl3s7fuqfej3bg7vov66hybsm

A Unified Optimization Approach for CNN Model Inference on Integrated GPUs

Leyuan Wang, Zhi Chen, Yizhi Liu, Yao Wang, Lianmin Zheng, Mu Li, Yida Wang
2019 Proceedings of the 48th International Conference on Parallel Processing - ICPP 2019  
The authors are also grateful to Frank Chen and Long Gao for providing devices for experiments, and Tianqi Chen for technical assistance. The entire work was done at AWS.  ...  ACKNOWLEDGMENT The authors thank the anonymous reviewers of the paper for valuable comments.  ...  on these GPUs and the existing unified optimization approach for these workloads.  ... 
doi:10.1145/3337821.3337839 dblp:conf/icpp/WangCLWZLW19 fatcat:ptvsneujwjdmhesvcrune7rqwy

D4.1 Programming Language And Runtime System: Requirements

Hans Vandierendonck
2016 Zenodo  
This document elaborates the requirements for the VINEYARD programming model and runtime system.  ...  One of the components in the VINEYARD is the programming model and runtime system support, which is developed in Work Package 4.  ...  This has an impact on memory management and scheduling. Memory management techniques must be aware that the memory space of a virtualized accelerator is shared between resources.  ... 
doi:10.5281/zenodo.898162 fatcat:h4qoibk26vfzdao5badtj6fdie
« Previous Showing results 1 — 15 out of 185 results