Filters








15,291 Hits in 5.3 sec

The Hierarchical Memory Machine Model for GPUs

Koji Nakano
2013 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum  
The Discrete Memory Machine (DMM) and the Unified Memory Machine (UMM) are theoretical parallel computing models that capture the essence of the shared memory access and the global memory access of GPUs  ...  The main contribution of this paper is to introduce the Hierarchical Memory Machine (HMM), which consists of multiple DMMs and a single UMM.  ...  THE HIERARCHICAL MEMORY MACHINE MODEL (HMM) This section is devoted to present the Hierarchical Memory Machine Model (HMM), a more realistic parallel machine model that capture the architecture of GPUs  ... 
doi:10.1109/ipdpsw.2013.17 dblp:conf/ipps/Nakano13 fatcat:nz3gevfcgvaanby73654gsfpme

Designing a unified programming model for heterogeneous machines

Michael Garland, Manjunath Kudlur, Yili Zheng
2012 2012 International Conference for High Performance Computing, Networking, Storage and Analysis  
We describe the design of the Phalanx programming model, which seeks to provide a unified programming model for heterogeneous machines.  ...  Moreover, the current state of the art in programming heterogeneous machines tends towards using separate programming models, such as OpenMP and CUDA, for different portions of the machine.  ...  Its hierarchical organization of threads provides a natural scoping mechanism for shared resources, such as memory and barriers, and its hierarchical machine model allows programmers to control the placement  ... 
doi:10.1109/sc.2012.48 dblp:conf/sc/GarlandKZ12 fatcat:ua2yayf6svf5nbk7nlmlks267i

LightSeq: A High Performance Inference Library for Transformers [article]

Xiaohui Wang, Ying Xiong, Yang Wei, Mingxuan Wang, Lei Li
2021 arXiv   pre-print
LightSeq includes a series of GPU optimization techniques to to streamline the computation of neural layers and to reduce memory footprint.  ...  In this paper, we propose LightSeq, a highly efficient inference library for models in the Transformer family.  ...  Acknowledgments We would like to thank the colleagues in machine translation service and advertisement service to support our experiments in online environments and apply LightSeq into real-time systems  ... 
arXiv:2010.13887v4 fatcat:ykvkvwampbcgjoiksd73c5p7oi

Snap ML: A Hierarchical Framework for Machine Learning [article]

Celestine Dünner, Thomas Parnell, Dimitrios Sarigiannis, Nikolas Ioannou, Andreea Anghel, Gummadi Ravi, Madhusudanan Kandasamy, Haralampos Pozidis
2018 arXiv   pre-print
We describe a new software framework for fast training of generalized linear models.  ...  The framework, named Snap Machine Learning (Snap ML), combines recent advances in machine learning systems and algorithms in a nested manner to reflect the hierarchical architecture of modern computing  ...  *Trademark, service mark, registered trademark of International Business Machines Corporation in the United States, other countries, or both. ** Intel Xeon is a trademarks or registered trademarks of Intel  ... 
arXiv:1803.06333v3 fatcat:l75n5irwuzcatgwrw5jhgvxro4

Hierarchical Roofline Analysis: How to Collect Data using Performance Tools on Intel CPUs and NVIDIA GPUs [article]

Charlene Yang
2020 arXiv   pre-print
This paper surveys a range of methods to collect necessary performance data on Intel CPUs and NVIDIA GPUs for hierarchical Roofline analysis.  ...  These tools will be used to collect information for as many memory/cache levels in the memory hierarchy as possible in order to provide insights into application's data reuse and cache locality characteristics  ...  system, but this has been extended to the entire memory hierarchy in recent years, named the hierarchical Roofline model.  ... 
arXiv:2009.02449v4 fatcat:usmspjcwpvdgjlgjl6f2lya2i4

Accelerating Regular Expression Matching Using Hierarchical Parallel Machines on GPU

Cheng-Hung Lin, Chen-Hsiung Liu, Shih-Chieh Chang
2011 2011 IEEE Global Telecommunications Conference - GLOBECOM 2011  
In order to accelerate regular expression matching and resolve the problem of state explosion, we propose a GPU-based approach which applies hierarchical parallel machines to fast recognize suspicious  ...  However, the expressive power of regular expressions accompanies the intensive computation and memory consumption which leads to severe performance degradation.  ...  HIERARCHICAL PARALLEL MACHINES In this section, we propose a GPU-based parallel approach which applies hierarchical state machines to accelerate the matching process of the two complex regular expressions  ... 
doi:10.1109/glocom.2011.6133663 dblp:conf/globecom/LinLC11 fatcat:2s3l4mpfqrhdtashka45j2go3q

Distributed Hierarchical GPU Parameter Server for Massive Scale Deep Learning Ads Systems [article]

Weijie Zhao, Deping Xie, Ronglai Jia, Yulei Qian, Ruiquan Ding, Mingming Sun, Ping Li
2020 arXiv   pre-print
Deep learning models in online advertising industries can have terabyte-scale parameters that do not fit in the GPU memory nor the CPU main memory on a computing node.  ...  We propose a hierarchical workflow that utilizes GPU High-Bandwidth Memory, CPU main memory and SSD as 3-layer hierarchical storage.  ...  Another challenge is the big model size for GPU platforms. When the model becomes bigger, the limited GPU memory cannot hold the entire model.  ... 
arXiv:2003.05622v1 fatcat:kfl2uv7oarfsfa7zpkgps76h6e

Hierarchical Place Trees: A Portable Abstraction for Task Parallelism and Data Movement [chapter]

Yonghong Yan, Jisheng Zhao, Yi Guo, Vivek Sarkar
2010 Lecture Notes in Computer Science  
In this paper, we introduce the hierarchical place tree (HPT) model as a portable abstraction for task parallelism and data movement.  ...  Preliminary results on general-purpose multicore processors and GPU accelerators indicate that the HPT model can be a promising portable abstraction for future multicore processors.  ...  Finally, we would like to thank the anonymous reviewers for their comments and suggestions, which helped improve the overall presentation of the paper.  ... 
doi:10.1007/978-3-642-13374-9_12 fatcat:j57qe2fafrbydk45l43i6lcd6u

A Distributed Multi-GPU System for Large-Scale Node Embedding at Tencent [article]

Wanjing Wei, Yangzihao Wang, Pin Gao, Shijie Sun, Donghai Yu
2021 arXiv   pre-print
We propose a hierarchical data partitioning strategy and an embedding training pipeline to optimize both communication and memory usage on a GPU cluster.  ...  Comparing with the current state-of-the-art multi-GPU single-node embedding system, our system achieves 5.9x-14.4x speedup on average with competitive or better accuracy on open datasets.  ...  We also thank Stanley Tzeng for his proofreading of the manuscript.  ... 
arXiv:2005.13789v3 fatcat:e7c7u6zpmzf23hjk6jjny2bora

Hierarchical Roofline Performance Analysis for Deep Learning Applications [article]

Charlene Yang, Yunsong Wang, Steven Farrell, Thorsten Kurth, Samuel Williams
2020 arXiv   pre-print
This methodology allows for automated machine characterization and application characterization for Roofline analysis across the entire memory hierarchy on NVIDIA GPUs, and it is validated by a complex  ...  This paper presents a practical methodology for collecting performance data necessary to conduct hierarchical Roofline analysis on NVIDIA GPUs.  ...  These two components together comprise the complete data collection methodology for machine and application characterization in a hierarchical Roofline analysis on NVIDIA GPUs. A.  ... 
arXiv:2009.05257v4 fatcat:4lus2wltafg77mt54a6wzc7gp4

Gappy Pattern Matching on GPUs for On-Demand Extraction of Hierarchical Translation Grammars

Hua He, Jimmy Lin, Adam Lopez
2015 Transactions of the Association for Computational Linguistics  
do not work for hierarchical models, which require matching patterns that contain gaps.  ...  We believe that GPU-based extraction of hierarchical grammars is an attractive proposition, particularly for MT applications that demand high throughput.  ...  We also thank UMIACS for providing hardware resources via the NVIDIA CUDA Center of Excellence, and the UMIACS IT staff, especially Joe Webster, for excellent support.  ... 
doi:10.1162/tacl_a_00124 fatcat:kjjresiiwjhsjhx2c6ql3zusrm

Parameter Hub: a Rack-Scale Parameter Server for Distributed Deep Neural Network Training [article]

Liang Luo, Jacob Nelson, Luis Ceze, Amar Phanishayee, Arvind Krishnamurthy
2018 arXiv   pre-print
PHub co-designs the PS software and hardware to accelerate rack-level and hierarchical cross-rack parameter exchange, with an API compatible with many DDNN training frameworks.  ...  PHub provides a performance improvement of up to 2.7x compared to state-of-the-art distributed training techniques for cloud-based ImageNet workloads, with 25% better throughput per dollar.  ...  Conclusion We found that inefficient PS software architecture and network environment-induced overhead were the major bottlenecks of distributed training with modern GPUs in the cloud, making DDNN training  ... 
arXiv:1805.07891v1 fatcat:jrur6u3vjfgrxpfi6lialuhoru

8 Steps to 3.7 TFLOP/s on NVIDIA V100 GPU: Roofline Analysis and Other Tricks [article]

Charlene Yang
2020 arXiv   pre-print
An array of techniques used to analyze this OpenACC kernel and optimize its performance are shown, including the use of hierarchical Roofline performance model and the performance tool Nsight Compute.  ...  on an NVIDIA V100 GPU, with 8 optimization steps.  ...  The hierarchical Roofline models [3] looks at data transactions between each pair of memory/cache levels, and on NVIDIA GPUs, we particularly focus on data transactions between these three levels, device  ... 
arXiv:2008.11326v4 fatcat:kwwir5eiwfaapigxuqxdepbaui

Parallel Programming Models for Dense Linear Algebra on Heterogeneous Systems

2016 Supercomputing Frontiers and Innovations  
Of interest is the evolution of the programming models for DLA libraries -in particular, the evolution from the popular LAPACK and ScaLAPACK libraries to their modernized counterparts PLASMA (for multicore  ...  We present a review of the current best practices in parallel programming models for dense linear algebra (DLA) on heterogeneous architectures.  ...  This paper is distributed under the terms of the Creative Commons Attribution-Non Commercial 3.0 License which permits non-commercial use, reproduction and distribution of the work without further permission  ... 
doi:10.14529/jsfi150405 fatcat:avnmwu4dozdmjksknrlznhpv7u

The Approximate String Matching on the Hierarchical Memory Machine, with Performance Evaluation

Duhu Man, Koji Nakano, Yasuaki Ito
2013 2013 IEEE 7th International Symposium on Embedded Multicore Socs  
The Hierarchical Memory Machine (HMM) is a theoretical parallel computing model that captures the essence of computing on CUDA-enabled GPUs.  ...  The main contribution of this paper is to show an optimal parallel algorithm for the approximate string matching on the HMM and to implement it on a CUDA-enabled GPU.  ...  MEMORY MACHINE MODELS: THE DMM, THE UMM, AND THE HMM We first define the Discrete Memory Machine (DMM) of width Û and latency Ð. Let Ñ ( ¼) denote a memory cell of address in the memory.  ... 
doi:10.1109/mcsoc.2013.22 fatcat:iq6vii245ncdjkawh5hvkro3ge
« Previous Showing results 1 — 15 out of 15,291 results