Filters








38 Hits in 5.9 sec

Performance Evaluation of Advanced Features in CUDA Unified Memory [article]

Steven W. D. Chien, Ivy B. Peng, Stefano Markidis
2019 arXiv   pre-print
CUDA Unified Memory improves the GPU programmability and also enables GPU memory oversubscription.  ...  However, when GPU memory is oversubscribed by about 50%, using memory advises results in up to 25% performance improvement compared to the basic CUDA Unified Memory.  ...  Currently, modern CPUs support 48-bit memory addresses while Unified Memory uses 49bit virtual addressing, which can address both host and GPU memories [14] .  ... 
arXiv:1910.09598v1 fatcat:u7ik3kfsirdglbzzl6djd3kyti

Traversing large graphs on GPUs with unified memory

Prasun Gera, Hyojong Kim, Piyush Sao, Hyesoon Kim, David Bader
2020 Proceedings of the VLDB Endowment  
Recent hardware and software advances make it possible to address much larger host memory transparently as a part of a feature known as unified virtual memory.  ...  Due to the limited capacity of GPU memory, the majority of prior work on graph applications on GPUs has been restricted to graphs of modest sizes that fit in memory.  ...  Government, and no official endorsement should be inferred. The U.S.  ... 
doi:10.14778/3384345.3384358 fatcat:ormw2g7v7jf5rdpnz2gm6ewnya

Architectural and Operating System Support for Virtual Memory

Abhishek Bhattacharjee, Daniel Lustig
2017 Synthesis Lectures on Computer Architecture  
Thank you also to Trey Cain, Derek Hower, Lisa Hsu, Aamer Jaleel, Yatin Manerkar, Michael Pellauer, and Caroline Trippel for the countless helpful discussions about virtual memory and memory system behavior  ...  We also thank the many collaborators with whom we have explored various topics pertaining to virtual memory.  ...  Split TLBs store translation information for instruction and data memory separately, while unified TLBs store both together.  ... 
doi:10.2200/s00795ed1v01y201708cac042 fatcat:4re5afn53jhu7ezxwtb25ja3ca

Enhancing Programmability, Portability, and Performance with Rich Cross-Layer Abstractions [article]

Nandita Vijaykumar
2019 arXiv   pre-print
, and performance in CPUs and GPUs.  ...  Programmability, performance portability, and resource efficiency have emerged as critical challenges in harnessing complex and diverse architectures today to obtain high performance and energy efficiency  ...  Modern systems employ a large variety of components to optimize memory performance (e.g., prefetchers, caches, memory controllers). The semantic gap has two important implications: Implication 1.  ... 
arXiv:1911.05660v1 fatcat:w5f3g4isqbcphm2jjfzjtvrjnq

Memory leads the way to better computing

H.-S. Philip Wong, Sayeef Salahuddin
2015 Nature Nanotechnology  
There was, however, virtually complete agreement about the key challenges that surfaced from the study, and the potential value that solving them may have towards advancing the field of high performance  ...  The goal of the study was to assay the state of the art, and not to either propose a potential system or prepare and propose a detailed roadmap for its development.  ...  While Catamount supports virtual addressing it does not support virtual memory.  ... 
doi:10.1038/nnano.2015.29 pmid:25740127 fatcat:d6iiuuwcozbxlgn4kxxzdzwd4m

Dynamic memory management for the efficient utilization of graphics processing units in interactive machine learning development [article]

Georgios Alexopoulos, National Technological University Of Athens
2022
Figure 8.15: Unified Virtual Addressing Unified Memory An exception to the 1-1 relationship between physical and virtual device memory pages are Unified Memory Allocations  ...  However, when memory is oversubscribed, and due to us enabling it through the use of Unified Memory, page faults can occur.  ... 
doi:10.26240/heal.ntua.21988 fatcat:3gr6hgyejzf4rogwdq6xag6diq

Transparent Memory Extension for Shared GPUs

Jens Kehne
2019
Dieses Auslagern wird aber durch die asynchrone Arbeitsweise aktueller GPUs erschwert: Anwendungen können GPU-Kernels zur Ausführung direkt an die GPU senden, ohne dafür das Betriebssystem aufrufen zu  ...  Das Betriebssystem hat so keine Kontrolle über den Ausführungszeitpunkt der GPU-Kernels. Darüber hinaus gehen aktuelle GPUs davon aus, dass sämtlicher Grafikspeicher, der einmal von einer [...]  ...  The memory of modern GPUs can be oversubscribed easily since these GPUs support virtual memory not unlike that found in CPUs.  ... 
doi:10.5445/ir/1000090871 fatcat:2wmuqtavo5gydfzjxyr6ik3yz4

PREMA: A Predictive Multi-task Scheduling Algorithm For Preemptible Neural Processing Units [article]

Yujeong Choi, Minsoo Rhu
2019 arXiv   pre-print
To amortize cost, cloud vendors providing DNN acceleration as a service to end-users employ consolidation and virtualization to share the underlying resources among multiple DNN service requests.  ...  We show that preemptive NPU multi-tasking can achieve an average 7.8x, 1.4x, and 4.8x improvement in latency, throughput, and SLA satisfaction, respectively.  ...  If the multiple checkpointed state oversubscribes NPU memory, the approach taken by Rhu et al.  ... 
arXiv:1909.04548v1 fatcat:mwsbnwmt6bcpxnozcjp56gjhtm

High-Performance and Scalable GPU Graph Traversal

Duane Merrill, Michael Garland, Andrew Grimshaw
2015 ACM Transactions on Parallel Computing  
It is also representative of a class of parallel computations whose memory accesses and work distribution are both irregular and data-dependent.  ...  This level of performance is several times faster than state-of-the-art implementations both CPU and GPU platforms.  ...  MULTI-GPU PARALLELIZATION Communication between GPUs is simplified by a unified virtual address space in which pointers can transparently reference data residing within remote GPUs.  ... 
doi:10.1145/2717511 fatcat:yspiacetirfpjnbrmungchwyue

Cost- and QoS-Efficient Serverless Cloud Computing [article]

Chavit Denninnart
2020 arXiv   pre-print
We determined the minimum chance of success thresholds for tasks to pass to get scheduled and executed.  ...  We dynamically adjust such thresholds based on multiple characteristics of the arriving workload and the system's conditions.  ...  Amazon cloud [Ama18] offers inconsistent heterogeneity in form of various Virtual Machine (VM) types, such as CPU-Optimized, Memory-Optimized, Disk-Optimized, and Accelerated Computing (GPU and FPGA  ... 
arXiv:2011.11711v1 fatcat:ra66iigninem7gqibmivsn3dba

Trace-Based Performance Analysis for Hardware Accelerators [chapter]

Guido Juckeland
2012 Tools for High Performance Computing 2011  
After introducing a generic approach that is suitable for any API based acceleration paradigm, the thesis derives a suggestion for a generic performance API for hardware accelerators and its implementation  ...  High-end computers, workstations, and mobile devices start to employ hardware accelerators to offload computationally intense and parallel tasks, while at the same time retaining a highly efficient scalar  ...  Thus, a pointer intended for usage on the GPU can very well point to a valid memory location on the host-i.e. hold the same virtual address-and a wrong usage of a GPU pointer can result in chaos.  ... 
doi:10.1007/978-3-642-31476-6_8 dblp:conf/ptw/Juckeland11 fatcat:4bmyaxswwrfldbt6gk52lhij7e

A Survey of Big Data Machine Learning Applications Optimization in Cloud Data Centers and Networks [article]

Sanaa Hamid Mohamed, Taisir E.H. El-Gorashi, Jaafar M.H. Elmirghani
2019 arXiv   pre-print
as virtualization, and software-defined networking that increasingly support big data systems.  ...  Moreover, we provide a brief review of data centers topologies, routing protocols, and traffic characteristics, and emphasize the implications of big data on such cloud data centers and their supporting  ...  This work was supported by the Engineering and Physical Sciences Research Council, INTERNET (EP/H040536/1), STAR (EP/K016873/1) and TOWS (EP/S016570/1) projects.  ... 
arXiv:1910.00731v1 fatcat:kvi3br4iwzg3bi7fifpgyly7m4

D7.3: Inventory of Exascale Tools and Techniques

Nicola Mc Donnell
2016 Zenodo  
In Section 2, we summarise our findings separately by topic: programming interfaces and standards, debuggers and profilers, scalable libraries and algorithms and I/O management techniques, European Exascale  ...  analysis and exploitation phase.  ...  Acknowledgements The authors would like to acknowledge and thank the Centres of Excellence for their cooperation with and contribution to this deliverable.  ... 
doi:10.5281/zenodo.6801725 fatcat:ez63t2znsvdcpnvijzi4c74dc4

D8.3.2: Final technical report and architecture proposal

Ramnath Sai Sagar, Jesus Labarta, Aad van der Steen, Iris Christadler, Herbert Huber
2010 Zenodo  
The document also suggests potential architectures for future machines, the level of performance we should expect and areas where research efforts should be dedicated.  ...  This document describes the activities in Work Package 8 Task 8.3 (WP8.3) updating and analysing results reported in D8.3.1 for the different WP8 prototypes.  ...  For larger matrices, the best performance is achieved when using CUBLAS and a full set of 4 processes, thus oversubscribing the GPUs with 2 tasks each.  ... 
doi:10.5281/zenodo.6546134 fatcat:35eigjqrzvb3vfd3pjud2oswtu

Portable, predictable and partitionable: a domain specific approach to heterogeneous computing

Gordon Inggs, David Thomas, Wayne Luk, Oppenheimer Memorial Trust, National Research Foundation (South Africa)
2016
Beyond Central Processing Units (CPUs), different architectures such as massively parallel Graphics Processing Units (GPUs) and reconfigurable Field Programmable Gate Arrays (FPGAs) are seeing widespread  ...  and FPGA platforms from many different vendors.  ...  There is a close analogy between language virtualization and hardware virtualization using virtual machines.  ... 
doi:10.25560/31595 fatcat:hljcrhx4hjbqxkavtrkui5dgsq
« Previous Showing results 1 — 15 out of 38 results