67,935 Hits in 4.1 sec


Kazem Cheshmi, Shoaib Kamil, Michelle Mills Strout, Maryam Mehri Dehnavi
2017 Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '17  
Sympiler is a domain-specific code generator that optimizes sparse matrix computations by decoupling the symbolic analysis phase from the numerical manipulation stage in sparse codes.  ...  As a result, the Sympiler-generated code outperforms highly-optimized matrix factorization codes from commonly-used specialized libraries, obtaining average speedups over Eigen and CHOLMOD of 3.8X and  ...  Figure 6 : 6 Sympiler's performance compared to Eigen for triangular solve. The stacked-bars show the performance of the Sympiler (numeric) code with VS-Block and VI-Prune.  ... 
doi:10.1145/3126908.3126936 dblp:conf/sc/CheshmiKSD17 fatcat:joe4jxi2lraelbjwo65l3sarpa

Opt: A Domain Specific Language for Non-linear Least Squares Optimization in Graphics and Imaging [article]

Zachary DeVito, Michael Mara, Michael Zollhöfer, Gilbert Bernstein, Jonathan Ragan-Kelley, Christian Theobalt, Pat Hanrahan, Matthew Fisher, Matthias Nießner
2017 arXiv   pre-print
The mathematical descriptions of these functions are extremely concise, but their implementation in real code is tedious, especially when optimized for real-time performance on modern GPUs in interactive  ...  Our compiler automatically transforms these specifications into state-of-the-art GPU solvers based on Gauss-Newton or Levenberg-Marquardt methods.  ...  ACKNOWLEDGMENTS This work was supported by the DOE O ce of Science ASCR in the ExMatEx and ExaCT Exascale Co-Design Centers, program manager Karen Pao; DARPA Contract No.  ... 
arXiv:1604.06525v3 fatcat:cslx7yclvbaxhpo5ialdmfbekm

Automatic Differentiation: Obtaining Fast and Reliable Derivatives — Fast [chapter]

Christian H. Bischof, Alan Carle, Peyvand M. Khademi, Gordon Pusch
1995 Control Problems in Industry  
We highlight some applications of ADIFOR to large industrial and scientific codes, and discuss the effectiveness and performance of our approach.  ...  After a brief discussion of methods of differentiating codes, we review automatic differentiation and introduce the ADIFOR automatic differentiation tool.  ...  is sparse (e.g., the identity), then if one ignores exact numerical cancellation, the left-hand side vector 20 in (2) has no fewer nonzeros than any of the vectors on the right-hand side.  ... 
doi:10.1007/978-1-4612-2580-5_1 fatcat:poymjqvlgranblai6ugbtarfka

Manifold Geometry with Fast Automatic Derivatives and Coordinate Frame Semantics Checking in C++ [article]

Leonid Koppel, Steven L. Waslander
2018 arXiv   pre-print
Computer vision and robotics problems often require representation and estimation of poses on the SE(3) manifold.  ...  We contrast the library with existing open source packages and show that it can evaluate Jacobians in forward and reverse mode with little to no runtime overhead compared to hand-coded derivatives.  ...  Contemporary real-time implementations typically use hand-coded, analytically derived Jacobians and rely on external C++ libraries, examined in Section III, for numerical and optimization routines.  ... 
arXiv:1805.01810v1 fatcat:oqflbwokgrcpfmi6zaxjciilqi

Compiler Technology for Blue Gene Systems [chapter]

Stefan Kral, Markus Triska, Christoph W. Ueberhuber
2006 Lecture Notes in Computer Science  
Compiling Fftw code, MAP reaches as much as 80% of the optimum performance of Blue Gene systems. In an application code MAP enabled a sustained performance of 60 Tflop/s to be reached on BlueGene/L.  ...  To reach the leading position in the Top500 supercomputing list, IBM had to put considerable effort into coding and tuning a limited range of low-level numerical kernel routines by hand.  ...  We have examined the performance attributed to the compiler backend used (xlc mapvect vs. map vect), finding that the MAP backend produces much better code for compilation units consisting of one large  ... 
doi:10.1007/11823285_29 fatcat:3gu4ufy2q5aj7hm76fhuscm6vm

Sparsity-Specific Code Optimization using Expression Trees [article]

Philipp Herholz, Xuan Tang, Teseo Schneider, Shoaib Kamil, Daniele Panozzo, Olga Sorkine-Hornung
2021 arXiv   pre-print
We show that our approach scales to large problems and can achieve speedups of two orders of magnitude on CPUs and three orders of magnitude on GPUs, compared to a set of manually optimized CPU baselines  ...  We introduce a code generator that converts unoptimized C++ code operating on sparse data into vectorized and parallel CPU or GPU kernels.  ...  Optimizing code by hand can significantly improve performance. We automatically generate optimized code that is even more efficient. All experiments in this figure were conducted on Intel.  ... 
arXiv:2110.12865v1 fatcat:cmbpwhg3rfdnhf4ugnqxmd2b7q

Optimizing ART: Radiative Transfer Forward Modeling Code for Solar Observations with ALMA

Marcin Krotkiewski
2018 Zenodo  
Performance tests have shown that on the Broadwell architecture the optimized code works from 2.5x faster (RT solver) to 13x faster (EOS solver) on a single core.  ...  MPI implementation of the code scales with 95% efficiency on 2048 cores.  ...  Performance of the optimized code is discussed in Chapter 5 Performance.  ... 
doi:10.5281/zenodo.2633704 fatcat:awl6j2v5wrdola7qbrk25xhwzq

Automated and parallel code generation for finite-differencing stencils with arbitrary data types

K.A. Hawick, D.P. Playne
2010 Procedia Computer Science  
Achieving high performance on numerical solutions to PDEs generally requires exposure of the field data structures and application of knowledge of how best to map them to the memory and processing architecture  ...  We report on some performance evaluations for our generated PDE simulations on GPUs and other platforms.  ...  Example Performance Results To test the usefulness of the arbitrary data-type stencil manipulator in conjunction with the code-generator, we compare the performance of a generated simulation vs a hand-written  ... 
doi:10.1016/j.procs.2010.04.201 fatcat:dnzrj6cp6jbalbzglvavdnh5fa

OCCAL A Mixed Symbolic-Numeric Optimal Control CALculator [chapter]

Rainer Schöpf, Peter Deuflhard
1994 Computational Optimal Control  
In simpler problems, the present version of OCCAL automatically produces the full subroutine input for a MULtiple shooting code (MULCON) with adaptive numerical CONtinuation.  ...  Examples illustrate the performance of OCCAL/MULCON.  ...  The authors wish to thank M. Wulkow for several helpful discussions.  ... 
doi:10.1007/978-3-0348-8497-6_21 fatcat:qjcnpsac7jgsfjlgrwbi7caq64

GPU Support for Automatic Generation of Finite-Differences Stencil Kernels [article]

Vitor Hugo Mickus Rodrigues, Lucas Cavalcante, Maelso Bruno Pereira, Fabio Luporini, István Reguly, Gerard Gorman, Samuel Xavier de Souza
2019 arXiv   pre-print
Our study reveals that improving memory usage should be the most efficient strategy for leveraging the performance of the implemented solution on the evaluated architectures.  ...  We embed it with the Oxford Parallel Domain Specific Language (OP-DSL) in order to enable automatic code generation for GPU architectures from a high-level representation.  ...  One can also see that aggressive optimization produces code with better performance than with basic optimization in all scenarios, enabling approximately 24% of peak performance to be achieved versus 6%  ... 
arXiv:1912.00695v1 fatcat:w2z25hm3ujchrd3ke5wcjkik4y

Pydgin: generating fast instruction set simulators from simple architecture descriptions with meta-tracing JIT compilers

Derek Lockhart, Berkin Ilbeyi, Christopher Batten
2015 2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)  
comparable to hand-coded DBT-ISSs.  ...  Construction of frameworks capable of providing both the productivity benefits of ADL-generated simulators and the performance benefits of DBT remains a significant challenge.  ...  We would like to sincerely thank Carl Friedrich Bolz and Maciej Fijałkowski for their assistance in performance tuning Pydgin as well as their valuable feedback on the paper.  ... 
doi:10.1109/ispass.2015.7095811 dblp:conf/ispass/LockhartIB15 fatcat:2hseye22xrazxakmfteczsi3ri

Performance Portability and Unified Profiling for Finite Element Methods on Parallel Systems

Vladyslav Kucher, Jens Hunloh, Sergei Gorlatch
2020 Advances in Science, Technology and Engineering Systems  
However, today's application programming for parallel systems lacks performance portability: the same program code cannot achieve stable high performance on different parallel architectures.  ...  Systems composed of such processors enable high-performance execution of demanding applications like numerical Finite Element Methods.  ...  Acknowledgement The authors gratefully acknowledge generous support from the German Federal Ministry of Education and Research (BMBF) within the HPC 2 S E project.  ... 
doi:10.25046/aj050116 fatcat:lpl5a7hzvva6zglqvgedauorbu

Automatic Generation of Efficient Adjoint Code for a Parallel Navier-Stokes Solver [chapter]

Patrick Heimbach, Chris Hill, Ralf Giering
2002 Lecture Notes in Computer Science  
To achieve a tractable problem in both CPU and memory requirements, despite the control flow reversal, the adjoint code relies heavily on the balancing of storing vs. recomputation via the checkpointing  ...  The adjoint code of the parallel MIT general circulation model is generated using TAMC.  ...  Acknowledgments This is paper is a contribution to the ECCO project, supported by NOPP, and with funding from NASA, NSF and ONR.  ... 
doi:10.1007/3-540-46080-2_107 fatcat:ycwrkpxvuva4tgamnyovtew2oq

Automatic segmentation of cell nuclei in bladder and skin tissue for karyometric analysis

Vrushali R Korde, Hubert Bartels, Jennifer Barton, James Ranger-Moore
2009 Analytical and quantitative cytology and histology  
The same procedure was performed on 10 skin histology images with a sensitivity of 83.0% and median proportional difference of 2.6%.  ...  This robust segmentation technique used properties of the image histogram to optimally select a threshold and create closed 4-way chain code nuclear segmentations.  ...  Acknowledgments Research was partially supported by a grant from the National Institutes of Health, P01 CA27502, and by the NIH Biomedical Imaging and Spectroscopy Training Grant at the University of Arizona  ... 
pmid:19402384 pmcid:PMC2810397 fatcat:nqrgt3p46zcsjavmqek57eiocu

Static Compilation Analysis for Host-Accelerator Communication Optimization [chapter]

Mehdi Amini, Fabien Coelho, François Irigoin, Ronan Keryell
2013 Lecture Notes in Computer Science  
Our automatic approach uses two simple heuristics: to perform transfers to the accelerator as early as possible and to delay transfers back from the accelerator as late as possible.  ...  We present experimental results obtained with the Polybench 2.0, some Rodinia benchmarks, and with a real numerical simulation.  ...  Acknowledgments We are grateful to Béatrice Creusillet, Pierre Jouvelot, and Eugene Ressler for their numerous comments and suggestions which helped us improve our presentation, to Dominique Aubert who  ... 
doi:10.1007/978-3-642-36036-7_16 fatcat:jpftk6kotjbgtnq2pux3ep7wly
« Previous Showing results 1 — 15 out of 67,935 results