25 Hits in 5.9 sec

Compiler/Runtime Framework for Dynamic Dataflow Parallelization of Tiled Programs

Martin Kong, Antoniu Pop, Louis-Noël Pouchet, R. Govindarajan, Albert Cohen, P. Sadayappan
2015 ACM Transactions on Architecture and Code Optimization (TACO)  
(t3=max ( upon: program structure and applied transformations Some transformations could derive on loss of locality or parallelism kernel performance in GFLOPS/sec for AMD Opteron 6274 (16 cores) and  ...  #pragma omp parallel for private(lbv,ubv) for (t2=lbp;t2<=ubp;t2++) 8.0;; } } if (_PB_N >= 4) { lbp=0; ubp=floord(_PB_N-3,32); for (t3=0;t3<=floord(_PB_N-2,32);t3++) for (t4=max(1,32*t2);t4<=min(_PB_N-  ... 
doi:10.1145/2687652 fatcat:bnfyp322v5dlffowuyqu7yfire


Subhankar Pal, Siying Feng, Dong-hyeon Park, Sung Kim, Aporva Amarnath, Chi-Sheng Yang, Xin He, Jonathan Beaumont, Kyle May, Yan Xiong, Kuba Kaszyk, John Magnus Morton (+8 others)
2020 Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques  
Transmuter addresses a rapidly growing set of algorithms exhibiting dynamic data movement patterns, irregularity, and sparsity, while delivering GPU-like efficiencies for traditional dense applications  ...  programmable for end users.  ...  ACKNOWLEDGMENTS We thank the anonymous reviewers for their helpful feedback.  ... 
doi:10.1145/3410463.3414627 dblp:conf/IEEEpact/PalFPKAYHBMXKMS20 fatcat:kwsaun2g65b6jl6mdqrhgiv7yq

Policy-based tuning for performance portability and library co-optimization

Duane Merrill, Michael Garland, Andrew Grimshaw
2012 2012 Innovative Parallel Computing (InPar)  
To illustrate, we survey the performance-portability landscape for a suite of common GPU primitives, evaluating thousands of reasonable program variants across a large diversity of problem instances (microarchitecture  ...  In particular, this approach enables flexible granularity coarsening which allows the expensive aspects of communication and the redundant aspects of data parallelism to scale with the width of the processor  ...  The twin burdens of expression and mapping have historically fallen separately upon the shoulders of the programmer and the compiler/runtime, respectively.  ... 
doi:10.1109/inpar.2012.6339597 fatcat:pvmge5vmbfaghcmnuc6bgesrzq


Dominik Adamski, Grzegorz Jabłoński
2017 Computer Science  
/Runtime Framework for Dynamic Dataflow Parallelization of Tiled Programs.  ...  not by data tiling but by dynamic dataflow parallelization [12]. 3.  ... 
doi:10.7494/csci.2017.18.2.145 fatcat:vz2p74icxfg4vnciy2imq4aimi

Automap: Towards Ergonomic Automated Parallelism for ML Models [article]

Michael Schaarschmidt and Dominik Grewe and Dimitrios Vytiniotis and Adam Paszke and Georg Stefan Schmid and Tamara Norman and James Molloy and Jonathan Godwin and Norman Alexander Rink and Vinod Nair and Dan Belov
2021 arXiv   pre-print
The rapid rise in demand for training large neural network architectures has brought into focus the need for partitioning strategies, for example by using data, model, or pipeline parallelism.  ...  Through a combination of inductive tactics and search in a platform-independent partitioning IR, automap can recover expert partitioning strategies such as Megatron sharding for transformer layers.  ...  Optimising data transfers and reasoning about cost happens at this level of the stack, before we eventually compile back to accelerator-specific HLO code and feed back into the XLA compiler/runtime.  ... 
arXiv:2112.02958v1 fatcat:tlda37oxgjeezggohojvh4sdni

The Dinamica Virtual Machine for Geosciences [chapter]

Bruno Morais Ferreira, Britaldo Silveira Soares-Filho, Fernando Magno Quintão Pereira
2015 Lecture Notes in Computer Science  
Dinamica EGO is a framework used in the development of geomodeling applications. Behind its multitude of visual modes and graphic elements, Dinamica EGO runs on top of a virtual machine.  ...  Dinamica's runtime addresses this challenge through a suite of optimizations, which borrows ideas from functional programming languages, and leverages specific behavior expected in geo-scientific programs  ...  shows the number of dynamic copies in the unoptimized program.  ... 
doi:10.1007/978-3-319-24012-1_4 fatcat:7aih24t5ibeulftqnpzxranrby

High Performance GPU Code Generation for Matrix-Matrix Multiplication using MLIR: Some Early Results [article]

Navdeep Katel, Vivek Khandelwal, Uday Bondhugula
2021 arXiv   pre-print
On a set of problem sizes we evaluated, initial performance results show that we are able to attain performance that is 95-119% and 80-160% of CuBLAS for FP32 and FP16 accumulate respectively on NVIDIA's  ...  We believe that these results could be used as motivation for further research and development on automatic code and library generation using IR infrastructure for similar specialized accelerators.  ...  We especially thank the developers of Affine, GPU, and LLVM dialects for providing useful infrastructure that was used extensively in this work.  ... 
arXiv:2108.13191v1 fatcat:2wmqo24fgnhtjpeuystsnbaa7y

OnSRAM: Efficient Inter-Node On-Chip Scratchpad Management in Deep Learning Accelerators

Subhankar Pal, Swagath Venkataramani, Viji Srinivasan, Kailash Gopalakrishnan
2022 ACM Transactions on Embedded Computing Systems  
runtime of a DL accelerator.  ...  Prior software optimizations start with an input graph and focus on node-level optimizations, viz. dataflows and hierarchical tiling, and graph-level optimizations such as operation fusion.  ...  Similar to other software stack extensions added to DL frameworks [93] , we integrated OnSRAM-Static and OnSRAM-Eager management within the graph compiler runtime of TensorFlow [3] .  ... 
doi:10.1145/3530909 fatcat:3v5q5adb5vf2dg2eqtw4r3dwiy

TVM: An Automated End-to-End Optimizing Compiler for Deep Learning [article]

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Meghan Cowan, Haichen Shen, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, Arvind Krishnamurthy
2018 arXiv   pre-print
Current frameworks rely on vendor-specific operator libraries and optimize for a narrow range of server-class GPUs.  ...  It also automates optimization of low-level programs to hardware characteristics by employing a novel, learning-based cost modeling method for rapid exploration of code optimizations.  ...  program), Oracle, Huawei and anonymous sources.  ... 
arXiv:1802.04799v3 fatcat:e6htzyqaqjhpnm3yyi6xl3mdoq


Janghaeng Lee, Mehrzad Samadi, Yongjun Park, Scott Mahlke
2015 ACM Transactions on Computer Systems  
Heterogeneous computing on CPUs and GPUs has traditionally used fixed roles for each device: the GPU handles data parallel work by taking advantage of its massive number of cores while the CPU handles  ...  In this article, we present the single-kernel multiple devices (SKMD) system, a framework that transparently orchestrates collaborative execution of a single data-parallel kernel across multiple asymmetric  ...  Dandelion [Rossbach et al. 2013 ] also proposes a compiler/runtime framework that takes C# sources with newer APIs, and converts them to CUDA code, and runtime manages execution between CPUs and GPUs  ... 
doi:10.1145/2798725 fatcat:jf25wwgehzbtleikqi66gkay2u

A High-Level Synthesis Approach Optimizing Accumulations in Floating-Point Programs Using Custom Formats and Operators

Yohann Uguen, Florent de Dinechin, Steven Derrien
2017 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)  
The present work lifts this restriction: it is a case study of enhancing an HLS design flow with non-standard operators, which can then be automatically optimized for their application context.  ...  A high-level synthesis approach optimizing accumulations in floating-point programs using custom formats and operators.  ...  Such patterns are therefore exposed to the compiler/runtime either by the user through directives, or automatically identified using static analysis techniques [17] , [7] .  ... 
doi:10.1109/fccm.2017.41 dblp:conf/fccm/UguenDD17 fatcat:44ostrl6m5dotpdngqq4b2rjjq

Tutorial: Open-Source EDA and Machine Learning for IC Design: A Live Update

Abdelrahman Hosny, Andrew B. Kahng
2020 2020 33rd International Conference on VLSI Design and 2020 19th International Conference on Embedded Systems (VLSID)  
For academic researchers, it speeds up the lifecycle of scientific progress and makes research results relevant to modern industry practice.  ...  He is currently PI of "OpenROAD", a $12.4M U.S. DARPA project targeting open-source, autonomous ("no human in the loop") tools for IC implementation.  ...  We will introduce the technology trends driving heterogeneity, provide an overview of computationally divergent and performance heterogeneous multi-cores, and present compiler, runtime support to fully  ... 
doi:10.1109/vlsid49098.2020.00016 dblp:conf/vlsid/HosnyK20 fatcat:gsvvnrgbr5dpdjwkkx63jnf2f4

Software challenges in extreme scale systems

Vivek Sarkar, William Harrod, Allan E Snavely
2009 Journal of Physics, Conference Series  
Dr His research areas include advanced VLSI and nano technologies, non von Neumann models of programming and execution, parallel algorithms and applications, and their impact on massively parallel computer  ...  Carlson is a member of the research staff at the IDA Center for Computing Sciences where, since 1990, his focus has been on applications and system tools for large-scale parallel and distributed computers  ...  To address this challenge, the compiler/runtime system must be very precise and accurate as to the major reasons for the lack of scaling.  ... 
doi:10.1088/1742-6596/180/1/012045 fatcat:iukutry2dvbitfdh6ng7kgz564

Machine Learning in Compiler Optimization

Zheng Wang, Michael O'Boyle
2018 Proceedings of the IEEE  
We then provide a comprehensive survey and provide a road map for the wide variety of different research areas.  ...  This paper provides both an accessible introduction to the fast moving area of machine-learning-based compilation and a detailed bibliography of its main achievements. 1 In fact, the term superoptimizer  ...  For instance is it possible to learn dataflow or point-to analysis?  ... 
doi:10.1109/jproc.2018.2817118 fatcat:vuebhfw7efcdpm5yyzaumia3vi

Enable Deep Learning on Mobile Devices: Methods, Systems, and Applications

Han Cai, Ji Lin, Yujun Lin, Zhijian Liu, Haotian Tang, Hanrui Wang, Ligeng Zhu, Song Han
2022 ACM Transactions on Design Automation of Electronic Systems  
To reduce the large design cost of these manual solutions, we discuss the AutoML framework for each of them, such as neural architecture search (NAS) and automated pruning and quantization.  ...  Internet of Things (IoT) devices.  ...  MCUNet [175] co-designs the efficient neural architecture and efficient compiler/runtime to enable ImageNet-scale applications on off-the-shelf microncontrollers.  ... 
doi:10.1145/3486618 fatcat:h6xwv2slo5eklift2fl24usine
« Previous Showing results 1 — 15 out of 25 results