136 Hits in 4.5 sec

PyCUDA and PyOpenCL: A scripting-based approach to GPU run-time code generation

Andreas Klöckner, Nicolas Pinto, Yunsup Lee, Bryan Catanzaro, Paul Ivanov, Ahmed Fasih
2012 Parallel Computing  
This article presents one simple technique, GPU run-time code generation (RTCG), along with PyCUDA and PyOpenCL, two open-source toolkits that support this technique.  ...  The premise of the paper is illustrated by a wide range of examples where the technique has been applied with considerable success.  ...  Acknowledgments We would like to thank Ian Cullinan, Tomasz Rybak, Chris Heuser, Romain Brette, and Dan Goodman who have graciously agreed to let us showcase their research in Section 6 of this article  ... 
doi:10.1016/j.parco.2011.09.001 fatcat:o7iwvib6mvawdjbb4kn6xarwce

Aspects for Stages: Cross Cutting Concerns for Metaprograms

Yannis Lilis, Anthony Savidis
2014 Journal of Object Technology  
In multi-stage languages the program code is finalized though a sequence of transformations defined in the program itself, a process known as staging, with stages also referred as metaprograms.  ...  We discuss their implementation in a language supporting compile-time metaprogramming, where aspects are realized as batches of AST transformation metaprograms, accompanied by an AOP-specific library.  ...  As we mentioned earlier, in-staging aspects transform stage metaprograms before they are evaluated, so the two execute sequentially.  ... 
doi:10.5381/jot.2014.13.1.a1 fatcat:nxprwjcc55aijca4dww2csmupq

DESOLA: An active linear algebra library using delayed evaluation and runtime code generation

Francis P. Russell, Michael R. Mellor, Paul H.J. Kelly, Olav Beckmann
2011 Science of Computer Programming  
Active libraries can be defined as libraries which play an active part in the compilation, in particular, the optimisation of their client code.  ...  This paper explores the implementation of an active dense linear algebra library by delaying evaluation of expressions built using library calls, then generating code at runtime for the compositions that  ...  Parallelisation This provides a number of interesting research topics. Loop fusion can inhibit parallelisation when sequential and parallel loops are fused.  ... 
doi:10.1016/j.scico.2008.06.002 fatcat:r6htqsbponagjgdfy4qpiar5su

High-Order Discontinuous Galerkin Methods by GPU Metaprogramming [chapter]

Andreas Klöckner, Timothy Warburton, Jan S. Hesthaven
2013 Lecture Notes in Earth System Sciences  
In a recent publication, we have shown that DG methods also adapt readily to execution on modern, massively parallel graphics processors (GPUs).  ...  In this article, we illuminate a few of the more practical aspects of bringing DG onto a GPU, including the use of a Python-based metaprogramming infrastructure that was created specifically to support  ...  While many rather complicated systems involving C++ metaprogramming strive to achieve this goal, they cannot match the simplicity (and performance) of textually pasting a chunk of purpose-specific C code  ... 
doi:10.1007/978-3-642-16405-7_23 fatcat:t5dciybhkndrtjy3n6gawygmxm

High-Order Discontinuous Galerkin Methods by GPU Metaprogramming [article]

Andreas Klöckner and Timothy Warburton and Jan S. Hesthaven
2012 arXiv   pre-print
In a recent publication, we have shown that DG methods also adapt readily to execution on modern, massively parallel graphics processors (GPUs).  ...  In this article, we illuminate a few of the more practical aspects of bringing DG onto a GPU, including the use of a Python-based metaprogramming infrastructure that was created specifically to support  ...  While many rather complicated systems involving C++ metaprogramming strive to achieve this goal, they cannot match the simplicity (and performance) of textually pasting a chunk of purpose-specific C code  ... 
arXiv:1211.0582v1 fatcat:4dafkf4uuba3zprh6zomndri7a

Embedding a Hardware Description Language in Template Haskell [chapter]

John T. O'Donnell
2004 Lecture Notes in Computer Science  
A new solution to these problems is based on program transformations performed automatically by metaprograms in Template Haskell.  ...  DSL code serve also as the executable Haskell code.  ...  Section 4.6 shows how metaprogramming solves this problem, enabling us to have software logic probes while retaining the correct semantics of the circuit.  ... 
doi:10.1007/978-3-540-25935-0_9 fatcat:sbbvahhhafg4xlzbtak6tbk3xy

Object-Oriented Service Clouds for Transdisciplinary Computing [chapter]

Michael Sobolewski
2012 Cloud Computing and Services Science  
The SORCER operating system (SOS) supports the two-way convergence of three programming models for transdisciplinary computing in service clouds.  ...  On one hand, EOP is uniformly converged with VOP and VOM to express an explicit network-centric service-oriented (SO) computation process in terms of other implicit (inter/intra) process expressions.  ...  and related resources, and enables the collaboration of the required service providers according to the metaprogram definition with its own control strategy.  ... 
doi:10.1007/978-1-4614-2326-3_1 fatcat:gwhvizlpwfdytdzarsfgddtina

A Local-View Array Library for Partitioned Global Address Space C++ Programs

Amir Kamil, Yili Zheng, Katherine Yelick
2014 Proceedings of ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming - ARRAY'14  
These arrays provide a local view of data, where each rank constructs its own portion of a global data structure, matching the local view of execution common to PGAS programs and providing maximum flexibility  ...  Unlike Titanium, which has its own compiler with array-specific analyses, optimizations, and code generation, we implement multidimensional arrays solely through a C++ library.  ...  Acknowledgment This work was supported in part by the Department of Energy's Office of Science, Advanced Scientific Computing Research X-Stack Program under Lawrence Berkeley National Laboratory (LBNL)  ... 
doi:10.1145/2627373.2627378 dblp:conf/pldi/KamilZY14 fatcat:u4aohl2kgncm7mxjbaqfhobxwa


Shigeo Itou, Satoshi Matsuoka, Hirokazu Hasegawa
2000 Proceedings of the ACM 2000 conference on Java Grande - JAVA '00  
Once AJaPACK is downloaded and executed, the Java version of ATLAS (ATLAS for Java) and the parallelized version of JLAPACK combine to achieve optimized pure Java execution for the given environment.  ...  Benchmarks have shown that AJaPACK achieves approximately 1/2 to 1/5 of the speed of optimized C-ATLAS and vendor supplied BLAS libraries, and with portable parallelization in SMP environnmnts, achieves  ...  As mentioned above, these are shown not to have the best sequential numerical execution speed, especially in comparison to the C counterparts.  ... 
doi:10.1145/337449.337529 dblp:conf/java/ItouMH00 fatcat:hksm6d5wtfesvfzlrmzovrpi3e

Layering RTL, SAFL, Handel-C and Bluespec constructs on Chisel HCL

David J Greaves
2015 2015 ACM/IEEE International Conference on Formal Methods and Models for Codesign (MEMOCODE)  
We include RTL, SAFL-style functional hardware description, Handel-C message passing and Bluespec rules.  ...  Abstract-Chisel is a hardware construction language that supports a simplistic level of transactional programming via its Decoupled I/O primitives.  ...  These languages have staged execution (or embody metaprogramming) where part of the program is run at compile time and the remainder at run time.  ... 
doi:10.1109/memcod.2015.7340477 dblp:conf/memocode/Greaves15 fatcat:np7zgvfxejd6lolv4j326azrsu

ComposableThreads: Rethinking User-level Threads with Composability and Parametricity in C++

Wataru Endo, Shigeyuki Sato, Kenjiro Taura
2022 Journal of Information Processing  
However, most of them are typically built as huge sets of monolithic components which achieve customizability with additional costs via concrete C APIs.  ...  We have noticed that the zero-overhead abstraction of C++ is beneficial for assembling flexible user-level threading in a clearer manner.  ...  The parallel speedup of Argobots is also apparently limited compared to the other two systems.  ... 
doi:10.2197/ipsjjip.30.269 fatcat:lb2cwvwpabeulduyxwgiglqbbe

GenoMus: Representing Procedural Musical Structures with an Encoded Functional Grammar Optimized for Metaprogramming and Machine Learning

José López-Montes, Miguel Molina-Solana, Waldo Fajardo
2022 Applied Sciences  
This highly homogeneous and modular approach simplifies metaprogramming and maximizes search space.  ...  The core of GenoMus is a functional grammar designed to cover a wide range of styles, integrating traditional and contemporary composing techniques.  ...  Conflicts of Interest: The authors declare no conflict of interest.  ... 
doi:10.3390/app12168322 fatcat:kgss5i52sncqln74ue6rioz2wu

Increasing Capacity In Telemedicine Using Flow-Based Programming.pdf

Jeremy Thornton
2016 Figshare  
Develop Communicating Sequential Process and Graph Theoretic mathematical semantics for Flow-Based Programming(FBP) in order to define 3 fundamental FBP digraph types, build matching FBP software artifacts  ...  and subject their inherently stochastic scaling to statistical analysis in order to categorize their scaling as either bound by Amdahl's Law or approaching the linear scaling of Gustafson-Barsis' Law  ...  At each step the CSP compliance design rules are elucidated and how they are satisfied using the template metaprogramming techniques that the C++ language enables.  ... 
doi:10.6084/m9.figshare.3580755.v1 fatcat:s5up62llevadfkstav45t7esge

Enhancements Of The Massively Parallel Memory Allocator Scatteralloc And Its Adaption To The General Interface Mallocmc [article]

Carlchristian Helmut Johannes Eckert, Wolfgang E Nagel
2014 Zenodo  
The implementation presenteduses a policy based design to abstract aspects of the ScatterAlloc algorithm and the default CUDA on-device memory allocator.  ...  Acknowledgments I would like to thank the computational radiation physics group of the Helmholtz Zentrum Dresden Rossendorf for constant support in questions about C++, CUDA and parallel debugging.  ...  Steinberger et al. from Graz University of Technology for providing their original implementation of ScatterAlloc.  ... 
doi:10.5281/zenodo.34461 fatcat:iqxlkfudpfefpac6d2ifn4ohtu

A Case Study of Performance Degradation Attributable to Run-Time Bounds Checks on C++ Vector Access

David Flater, William F. Guthrie
2013 Journal of Research of the National Institute of Standards and Technology  
The benchmark consisted of a loop that wrote to array elements in sequential order.  ...  With this configuration, relative to the best performance observed for any access method in C or C++, mean degradation of only (0.881 ± 0.009) % was measured for a standard bounds-checking access method  ...  Benchmarks and Timing The C/C++ code example in Fig. 1 shows how the CPU time used to execute a variety of methods of assigning to elements of an array or array-like data structure was measured.  ... 
doi:10.6028/jres.118.012 pmid:26401432 pmcid:PMC4487316 fatcat:75fd2ywddbemncnypdkynndrmi
« Previous Showing results 1 — 15 out of 136 results