107,560 Hits in 4.6 sec

Efficient interpreter optimizations for the JVM

Gülfem Savrun-Yeniçeri, Wei Zhang, Huahan Zhang, Chen Li, Stefan Brunthaler, Per Larsen, Michael Franz
2013 Proceedings of the 2013 International Conference on Principles and Practices of Programming on the Java Platform Virtual Machines, Languages, and Tools - PPPJ '13  
Furthermore, the performance attained through our optimizations is comparable with custom compiler performance. We provide an easily accessible annotation-based interface to enable our optimizations.  ...  We present two optimizations targeting these bottlenecks and show that the performance of the optimized interpreters increases dramatically: we report speedups by a factor of up to 2.45 over the Jython  ...  Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the Defense Advanced Research Projects Agency  ... 
doi:10.1145/2500828.2500839 dblp:conf/pppj/Savrun-YeniceriZZLBLF13 fatcat:w5jxt3nvnbgwtcegvqn2zb5q4a

A multi-objective auto-tuning framework for parallel codes

Herbert Jordan, Peter Thoman, Juan J. Durillo, Simone Pellegrini, Philipp Gschwandtner, Thomas Fahringer, Hans Moritsch
2012 2012 International Conference for High Performance Computing, Networking, Storage and Analysis  
Focusing on individual code regions, our compiler uses a novel search technique to compute a set of optimal solutions, which are encoded into a multi-versioned executable.  ...  Additionally, we show that parallelism-aware multi-versioning approaches like our own gain a performance improvement of up to 70% over solutions tuned for only one specific number of threads.  ...  For the optimizer, a generic interface has been defined, including an abstract method to evaluate sets of configurations.  ... 
doi:10.1109/sc.2012.7 dblp:conf/sc/JordanTBPGFM12 fatcat:7gq43aun75dcpnpvady63cmmfy

CUDA: Compiling and optimizing for a GPU platform

Gautam Chakrabarti, Vinod Grover, Bastiaan Aarts, Xiangyun Kong, Manjunath Kudlur, Yuan Lin, Jaydeep Marathe, Mike Murphy, Jian-Zhong Wang
2012 Procedia Computer Science  
Current GPUs can run tens of thousands of hardware threads and have been optimized for graphics workloads.  ...  We evaluate these techniques, and present performance results that show significant improvements on hundreds of kernels as well as applications.  ...  The PTX code is compiled by a device specific optimizing code generator called PTXAS. The compiled host code is combined with the device code to create an executable application.  ... 
doi:10.1016/j.procs.2012.04.209 fatcat:2agka5x4y5aaxggejik6hotsne

Evaluating end-to-end optimization for data analytics applications in weld

Shoumik Palkar, Saman Amarasinghe, Samuel Madden, Matei Zaharia, James Thomas, Deepak Narayanan, Pratiksha Thaker, Rahul Palamuttam, Parimajan Negi, Anil Shanbhag, Malte Schwarzkopf, Holger Pirk
2018 Proceedings of the VLDB Endowment  
Unfortunately, there is no optimization across these libraries, resulting in performance penalties as high as an order of magnitude in many applications.  ...  In this work, we further develop the Weld vision by designing an automatic adaptive optimizer for Weld applications, and evaluating its impact on realistic data science workloads.  ...  Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.  ... 
doi:10.14778/3213880.3213890 fatcat:oesslpgfy5awlb32xnylmjlnoa

Rethinking the parallelization of random-restart hill climbing: a case study in optimizing a 2-opt TSP solver for GPU execution

Molly A. O'Neil, Martin Burtscher
2015 Proceedings of the 8th Workshop on General Purpose Processing using GPUs - GPGPU 2015  
We present and evaluate an implementation of random-restart hill climbing with 2-opt local search applied to TSP. Our implementation is capable of addressing large problem sizes at high throughput.  ...  Our code outperforms the existing implementations by up to 3X, evaluating up to 60 billion 2-opt moves per second on a single K40 GPU.  ...  Our code incorporates several optimizations from prior works. Listing 2 below illustrates the impact of these optimizations on the pseudo code for the nested move evaluation loops.  ... 
doi:10.1145/2716282.2716287 dblp:conf/ppopp/ONeilB15 fatcat:zjcvaursqfbtvfilkqpu6v4poa

Exploiting SIMD for complex numerical predicates

Dongxiao Song, Shimin Chen
2016 2016 IEEE 32nd International Conference on Data Engineering Workshops (ICDEW)  
Then, we investigate cost models for both singlethreaded and multi-threaded evaluation of filtering predicates.  ...  We find that the diversity of the predicates and the introduction of multiple threads pose significant challenges in modeling and optimizing complex predicates.  ...  The code is unrolled into sequential code to enable better optimizations by C/C++ compilers. C. Multi-threaded Implementation We support multi-threaded evaluation of conjunctive predicates.  ... 
doi:10.1109/icdew.2016.7495635 dblp:conf/icde/SongC16 fatcat:65ysy3cvfza7xime44hhdfvu7m

A MDE-Based Optimisation Process for Real-Time Systems

Olivier Gilles, Jérôme Hugues
2010 2010 13th IEEE International Symposium on Object/Component/Service-Oriented Real-Time Distributed Computing  
We first define a generic evaluation pipeline, define a library of elementary transformations and then shows how to use it through Domain-Specific Language to evaluate and then transform models.  ...  We illustrate this process on an AADL case study modeling a Generic Avionics Platform.  ...  Thus, we define the REAL theorem 3, which compute the distance to deadline of an optimized thread.  ... 
doi:10.1109/isorc.2010.38 dblp:conf/isorc/GillesH10 fatcat:xrsgqi6m55cozan6tikjhkf7cy

Characterizing Performance And Cache Impacts Of Code Multi-Versioning On Multicore Architectures

Peter Zangerl, Peter Thoman, Thomas Fahringer
2017 Zenodo  
Code multi-versioning is an increasingly widely adopted tool for implementing optimizations which respond to unknown or dynamically changing runtime conditions, without the performance overhead of just-in-time  ...  Despite this ongoing interest, there has been no comprehensivestudy of the impact of multi-versioning so far – particularly in a multi-threaded setting.  ...  ACKNOWLEDGEMENT This project has received funding from the European Union's Horizon 2020 research and innovation programme as part of the FETHPC AllScale project under grant agreement No 671603.  ... 
doi:10.5281/zenodo.375519 fatcat:iwq4kbhz3jccfjpnfet6z33dyq

A script-based autotuning compiler system to generate high-performance CUDA code

Malik Khan, Protonu Basu, Gabe Rudy, Mary Hall, Chun Chen, Jacqueline Chame
2013 ACM Transactions on Architecture and Code Optimization (TACO)  
This article introduces a Transformation Strategy Generator, a meta-optimizer that generates a set of transformation recipes, which are descriptions of the mapping of the sequential code to parallel CUDA  ...  This article presents a novel compiler framework for CUDA code generation.  ...  Using default optimization parameter values, we evaluate the set of variants.  ... 
doi:10.1145/2400682.2400690 fatcat:aqluktapgbhufnsy5l4qjp4bra

Stop and go: understanding yieldpoint behavior

Yi Lin, Kunshan Wang, Stephen M. Blackburn, Antony L. Hosking, Michael Norrish
2015 Proceedings of the 2015 ACM SIGPLAN International Symposium on Memory Management - ISMM 2015  
In this paper we identify and evaluate yieldpoint design choices, including previously undocumented designs and optimizations.  ...  This analysis gives new insight into a critical but overlooked aspect of garbage collector implementation, and identifies a new optimization and new opportunities for very low overhead profiling.  ...  We conduct a preliminary evaluation of code patching as an optimization.  ... 
doi:10.1145/2754169.2754187 dblp:conf/iwmm/LinWBHN15 fatcat:wi5ery4yibfozj23gi3xifdplq

Implementation and Evaluation of OpenMP for Hitachi SR8000 [chapter]

Yasunori Nishitani, Kiyoshi Negishi, Hiroshi Ohta, Eiji Nunohiro
2000 Lecture Notes in Computer Science  
To create an optimized code, the compiler can perform optimizations across inside and outside of a PARALLEL region or can produce a code optimized for a fixed number of processors according to the compile  ...  This paper describes the implementation and evaluation of the OpenMP compiler designed for the Hitachi SR8000 Super Technical Server.  ...  By default, our compiler generates an object that can run with any number of threads, but if -procnum=8 option is specified, the compiler generates codes especially optimized for the number of threads  ... 
doi:10.1007/3-540-39999-2_38 fatcat:nribcsaemrbwbonyktedeawdl4

Reducing branch divergence in GPU programs

Tianyi David Han, Tarek S. Abdelrahman
2011 Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units - GPGPU-4  
We conduct a preliminary evaluation of the two optimizations using both synthetic benchmarks and a highlyoptimized real-world application.  ...  Branch distribution reduces the length of divergent code by factoring out structurally similar code from the branch paths.  ...  We use a highly-optimized version of this code as our base to evaluate the two optimizations we propose.  ... 
doi:10.1145/1964179.1964184 dblp:conf/asplos/HanA11 fatcat:2e7su3h76fg3hk6yucz2zijkj4

HiCrypt: C to CUDA Translator for Symmetric Block Ciphers

Keisuke Iwai, Naoki Nishikawa, Takakazu Kurokawa
2012 2012 Third International Conference on Networking and Computing  
To cope with this problem, we developed a new translator, HiCrypt, which can generate an optimized CUDA program from a cipher program written in a standard C language with directives.  ...  Generated programs perform high throughput almost identical to hand optimized CUDA programs for all three cipher programs.  ...  IMPLEMENTATION AND EVALUATION A. Prototype of the translator A prototype of HiCrypt translator was implemented to confirm its effectiveness through performance evaluation of generated CUDA codes.  ... 
doi:10.1109/icnc.2012.16 dblp:conf/ic-nc/IwaiNK12 fatcat:nwmvssuu7bg4pbhdw3gjpws73u

Piecewise holistic autotuning of parallel programs with CERE

Mihail Popov, Chadi Akel, Yohan Chatelain, William Jalby, Pablo de Oliveira Castro
2017 Concurrency and Computation  
Also, as regions of code do not benefit from the same parameters, an overall program-evaluation (or monolithic evaluation) is not able to achieve the optimal per region optimization.  ...  Instead of evaluating an optimization across all the codes, CERE evaluates the optimization once with a selected codelet. Then, it extrapolates its impact over the other similar codelets.  ... 
doi:10.1002/cpe.4190 fatcat:26nlplvflzf23apkevha6h65m4

Towards Automatic Parallelization of Stream Processing Applications

Manuel F. Dolz, David Del Rio Astorga, Javier Fernandez, J. Daniel Garcia, Jesus Carretero
2018 IEEE Access  
The evaluation, using a synthetic video benchmark and a real-world computer vision application, demonstrates that the presented framework is capable of producing parallel and optimized versions of the  ...  Parallelizing and optimizing codes for recent multi-/many-core processors have been recognized to be a complex task.  ...  Finally, Listing 2c shows an optimized version of the Video-App parallel code according to the arrangement proposed by PiBa using the profile execution data. 1) PERFORMANCE EVALUATION Given the foregoing  ... 
doi:10.1109/access.2018.2855064 fatcat:3f6ovqtbkvdjdgf5zb5a6dlzt4
« Previous Showing results 1 — 15 out of 107,560 results