Filters








1,012 Hits in 6.6 sec

Dynamic Automatic Differentiation of GPU Broadcast Kernels [article]

Jarrett Revels, Tim Besard, Valentin Churavy, Bjorn De Sutter, Juan Pablo Vielma
2018 arXiv   pre-print
We show how forward-mode automatic differentiation (AD) can be employed within larger reverse-mode computations to dynamically differentiate broadcast operations in a GPU-friendly manner.  ...  We discuss an experiment in which a Julia implementation of our technique outperformed pure reverse-mode TensorFlow and Julia implementations for differentiating through broadcast operations within an  ...  The authors would also like to thank James Bradbury, Peter Ahrens, Mike Innes, Deniz Yuret, Ekin Akyürek and Simon Danisch for multiple helpful conversations and contributions to Julia's AD and GPU ecosystems  ... 
arXiv:1810.08297v3 fatcat:qmcpwffhwbg23esvipeohnvvwu

Message passing on data-parallel architectures

Jeff A. Stuart, John D. Owens
2009 2009 IEEE International Symposium on Parallel & Distributed Processing  
As a case study, we design and implement the "DCGN" API on NVIDIA GPUs that is similar to MPI and allows full access to the underlying architecture.  ...  By running a test suite of applications with different communication requirements, we find that a tolerable amount of overhead is incurred, somewhere between one and five percent depending on the application  ...  Developers are responsible for kernels, and they write for the GPU and CPU as desired. DCGN will not automatically convert a CPU kernel to a GPU kernel. Kernels are launched via calls to DCGN.  ... 
doi:10.1109/ipdps.2009.5161065 dblp:conf/ipps/StuartO09 fatcat:igt36w4vvbhpldhrp5ireolu5a

Fashionable Modelling with Flux [article]

Michael Innes, Elliot Saba, Keno Fischer, Dhairya Gandhi, Marco Concetto Rudilosso, Neethu Mariya Joy, Tejan Karmali, Avik Pal, Viral Shah
2018 arXiv   pre-print
We detail the fundamental principles of Flux as a framework for differentiable programming, give examples of models that are implemented within Flux to display many of the language and framework-level  ...  an overview of the larger ecosystem that Flux fits inside of.  ...  Automatic Batching The naturally data-parallel structure of many ML models makes them an excellent fit for massively parallel processors such as GPUs [23] and TPUs.  ... 
arXiv:1811.01457v3 fatcat:7qrera5ydzcwlfvwqxren7sk2q

Dynamic control flow in large-scale machine learning

Yuan Yu, Peter Hawkins, Michael Isard, Manjunath Kudlur, Rajat Monga, Derek Murray, Xiaoqiang Zheng, Martín Abadi, Paul Barham, Eugene Brevdo, Mike Burrows, Andy Davis (+3 others)
2018 Proceedings of the Thirteenth EuroSys Conference on - EuroSys '18  
Second, programs written in our model support automatic differentiation and distributed gradient computations, which are necessary for training machine learning models that use control flow.  ...  First, the branches of conditionals and bodies of loops can be partitioned across many machines to run on a set of heterogeneous devices, including CPUs, GPUs, and custom ASICs.  ...  the many users of TensorFlow.  ... 
doi:10.1145/3190508.3190551 dblp:conf/eurosys/YuABBBDDGHHIKMM18 fatcat:5u4gcsi5fba33mv2nyni32h424

CUDA-NP

Yi Yang, Huiyang Zhou
2014 Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming - PPoPP '14  
For a GPU kernel, an application developer only needs to add OpenMP-like pragmas for parallelizable code sections. Then, our CUDA-NP compiler automatically generates the optimized GPU kernels.  ...  However, with dynamic parallelism, a parent thread can only communicate with its child threads through global memory and the overhead of launching GPU kernels is non-trivial even within GPUs.  ...  We also want to thank the ARC Cluster [16] for providing Nvidia K20c GPUs.  ... 
doi:10.1145/2555243.2555254 dblp:conf/ppopp/YangZ14 fatcat:ts2yttcyzndrliod625rkzjyty

CUDA-NP: Realizing Nested Thread-Level Parallelism in GPGPU Applications

Yi Yang, Chao Li, Huiyang Zhou
2015 Journal of Computer Science and Technology  
For a GPU kernel, an application developer only needs to add OpenMP-like pragmas for parallelizable code sections. Then, our CUDA-NP compiler automatically generates the optimized GPU kernels.  ...  However, with dynamic parallelism, a parent thread can only communicate with its child threads through global memory and the overhead of launching GPU kernels is non-trivial even within GPUs.  ...  We also want to thank the ARC Cluster [16] for providing Nvidia K20c GPUs.  ... 
doi:10.1007/s11390-015-1500-y fatcat:42baxdr3hbf6tnxgo25ycfp2c4

CUDA-NP

Yi Yang, Huiyang Zhou
2014 SIGPLAN notices  
For a GPU kernel, an application developer only needs to add OpenMP-like pragmas for parallelizable code sections. Then, our CUDA-NP compiler automatically generates the optimized GPU kernels.  ...  However, with dynamic parallelism, a parent thread can only communicate with its child threads through global memory and the overhead of launching GPU kernels is non-trivial even within GPUs.  ...  We also want to thank the ARC Cluster [16] for providing Nvidia K20c GPUs.  ... 
doi:10.1145/2692916.2555254 fatcat:kxedfqo55fgrdoxghfjjbi27tu

Effective Extensible Programming: Unleashing Julia on GPUs

Tim Besard, Christophe Foket, Bjorn De Sutter
2019 IEEE Transactions on Parallel and Distributed Systems  
Moreover, use of the high-level Julia programming language enables new and dynamic approaches for GPU programming.  ...  We evaluate our approach by adding support for NVIDIA GPUs to the Julia programming language.  ...  ACKNOWLEDGMENTS This work is supported by the Institute for the Promotion of Innovation by Science and Technology in Flanders (IWT Vlaanderen), and by Ghent University through the Concerted Research Action  ... 
doi:10.1109/tpds.2018.2872064 fatcat:ev2zxoetcjg5zixthc5cmiwecq

Rapid software prototyping for heterogeneous and distributed platforms

Tim Besard, Valentin Churavy, Alan Edelman, Bjorn De Sutter
2019 Advances in Engineering Software  
With the continued stagnation of single-threaded performance, using hardware accelerators such as GPUs or FPGAs is necessary.  ...  In this model, programs are generically typed, the location of the data is encoded in the type system, and multiple dispatch is used to select functionality based on the type of the data.  ...  Tim Besard is supported by the Institute for the Promotion of Innovation by Science and Technology in Flanders (IWT), the Research Foundation -Flanders (FWO), and by Ghent University through the Concerted  ... 
doi:10.1016/j.advengsoft.2019.02.002 fatcat:k4k7oaexsnfaxdb76kcl53wf5e

Marian: Fast Neural Machine Translation in C++

Marcin Junczys-Dowmunt, Roman Grundkiewicz, Tomasz Dwojak, Hieu Hoang, Kenneth Heafield, Tom Neckermann, Frank Seide, Ulrich Germann, Alham Fikri Aji, Nikolay Bogoychev, André F. T. Martins, Alexandra Birch
2018 Proceedings of ACL 2018, System Demonstrations  
We present Marian, an efficient and selfcontained Neural Machine Translation framework with an integrated automatic differentiation engine based on dynamic computation graphs.  ...  We describe the design of the encoder-decoder framework and demonstrate that a research-friendly toolkit can achieve high training and translation speed.  ...  This research is based upon work supported in part by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via contract #FA8650-17-C-9117  ... 
doi:10.18653/v1/p18-4020 dblp:conf/acl/Junczys-Dowmunt18 fatcat:vl5hb5oitrcb5nv25w54qaqnqm

Marian: Fast Neural Machine Translation in C++

Marcin Junczys-Dowmunt, Roman Grundkiewicz, Tomasz Dwojak, Hieu Hoang, Kenneth Heafield, Tom Neckermann, Frank Seide, Ulrich Germann, Alham Fikri Aji, Nikolay Bogoychev, Andre F. T. Martins, Alexandra Birch
2018 Zenodo  
We present Marian, an efficient and selfcontained Neural Machine Translation framework with an integrated automatic differentiation engine based on dynamic computation graphs.  ...  We describe the design of the encoder-decoder framework and demonstrate that a research-friendly toolkit can achieve high training and translation speed.  ...  This research is based upon work supported in part by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via contract #FA8650-17-C-9117  ... 
doi:10.5281/zenodo.2551642 fatcat:52weuuur5fb73nvcybgog6a7ei

Jittor: a novel deep learning framework with meta-operators and unified graph execution

Shi-Min Hu, Dun Liang, Guo-Ye Yang, Guo-Wei Yang, Wen-Yang Zhou
2020 Science China Information Sciences  
This approach is as easy to use as dynamic graph execution yet has the efficiency of static graph execution.  ...  Jittor provides classes of Numpy-like operators, which we call meta-operators. A deep learning model built upon these meta-operators is compiled into high-performance CPU or GPU code in real-time.  ...  Acknowledgements This work was supported by National Natural Science Foundation of China (Grant No. 61521002).  ... 
doi:10.1007/s11432-020-3097-4 fatcat:t4ruztzhgbap5gvm4cliago5de

Lazy release consistency for GPUs

Johnathan Alsop, Marc S. Orr, Bradford M. Beckmann, David A. Wood
2016 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)  
GPUs.  ...  For example, we found that the previously proposed RSP implementation actually results in slowdowns of up to 30% on large GPUs, compared to a naïve baseline system that forgoes work stealing and scopes  ...  AMD, the AMD Arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. OpenCL is a trademark of Apple Inc. used by permission by Khronos.  ... 
doi:10.1109/micro.2016.7783729 dblp:conf/micro/AlsopOBW16 fatcat:p5u2mv5gyzbcfpu7gjoxop4rse

Marian: Fast Neural Machine Translation in C++ [article]

Marcin Junczys-Dowmunt, Roman Grundkiewicz, Tomasz Dwojak, Hieu Hoang, Kenneth Heafield, Tom Neckermann, Frank Seide, Ulrich Germann, Alham Fikri Aji, Nikolay Bogoychev, André F. T. Martins, Alexandra Birch
2018 arXiv   pre-print
We present Marian, an efficient and self-contained Neural Machine Translation framework with an integrated automatic differentiation engine based on dynamic computation graphs.  ...  We describe the design of the encoder-decoder framework and demonstrate that a research-friendly toolkit can achieve high training and translation speed.  ...  This research is based upon work supported in part by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via contract #FA8650-17-C-9117  ... 
arXiv:1804.00344v3 fatcat:ieankv2jh5asbpmnzb5lgkwioq

Automatic Full Compilation of Julia Programs and ML Models to Cloud TPUs [article]

Keno Fischer, Elliot Saba
2018 arXiv   pre-print
Our method composes well with existing compiler-based automatic differentiation techniques on Julia code, and we are thus able to also automatically obtain the VGG19 backwards pass and similarly offload  ...  They have powered many of Google's milestone machine learning achievements in recent years.  ...  dynamic compiler framework (Revels & Contributors, 2018) .  ... 
arXiv:1810.09868v1 fatcat:caiczrksvfhjhb7w666mps6mou
« Previous Showing results 1 — 15 out of 1,012 results