27 Hits in 4.5 sec

Advanced Stencil-Code Engineering (Dagstuhl Seminar 15161)

Christian Lengauer, Matthias Bolten, Robert D. Falgout, Olaf Schenk, Marc Herbstritt
2015 Dagstuhl Reports  
This report documents the program and the outcomes of Dagstuhl Seminar 15161 "Advanced Stencil-Code Engineering".  ...  It brought together experts from mathematics, computer science and applications to explore the challenges of very high performance and massive parallelism in solving partial differential equations.  ...  Autotuning divide-and-conquer stencil computations Ekanathan Palamadai Natarajan . . . . . . . . . . . . . . . . . . . . . . . . . . . .  ... 
doi:10.4230/dagrep.5.4.56 dblp:journals/dagstuhl-reports/LengauerBFS15 fatcat:suk5zayvlnb63ozlamrw2vk2w4

A Synergetic Approach to Throughput Computing on x86-Based Multicore Desktops

Chi-Keung Luk, Ryan Newton, William Hasenplaugh, Mark Hampton, Geoff Lowney
2011 IEEE Software  
We have implemented our approach with the Intel R Compiler and the newly developed Intel R Software Autotuning Tool.  ...  In the era of multicores, many applications that tend to require substantial compute power and data crunching (aka Throughput Computing Applications) can now be run on desktop PCs.  ...  Therefore, the divide-and-conquer paradigm fits very well to this architecture.  ... 
doi:10.1109/ms.2011.2 fatcat:3ysms4aeebarpfhdgbzprloyxi

The pochoir stencil compiler

Yuan Tang, Rezaul Alam Chowdhury, Bradley C. Kuszmaul, Chi-Keung Luk, Charles E. Leiserson
2011 Proceedings of the 23rd ACM symposium on Parallelism in algorithms and architectures - SPAA '11  
A stencil computation repeatedly updates each point of a ddimensional grid as a function of itself and its near neighbors.  ...  The parameters ta and tb are the beginning and ending time steps, and xa, xb, ya, and yb are the coordinates defining the region of the array u on which to perform the stencil computation.  ...  Thanks to Kaushik Datta of Reservoir Labs and Sam Williams of Lawrence Berkeley National Laboratory for providing us with the Berkeley autotuner code and help with running it.  ... 
doi:10.1145/1989493.1989508 dblp:conf/spaa/TangCKLL11 fatcat:ly2k5ojxfvdxbg2azjkla44ykm


2012 Parallel Processing Letters  
The results show that the space is program and platform dependent, non-linear, and that automatic search achieves a significant average speedup in program execution time of 1.6× over a human expert.  ...  We performed this using a Monte Carlo search of a random subset of the space, for a representative set of platforms and programs.  ...  Acknowledgements We would like to thank Marco Aldinucci for sharing his expertise on the implementation of FastFlow and Christopher Thompson for the use of his 6-core machine.  ... 
doi:10.1142/s0129626412400051 fatcat:k6gg3afw2jcapkynbtwclyofhe

An Autotuning Framework for Scalable Execution of Tiled Code via Iterative Polyhedral Compilation

Yukinori Sato, Tomoya Yuki, Toshio Endo
2019 ACM Transactions on Architecture and Code Optimization (TACO)  
From an evaluation on many-core CPUs, we demonstrate that our autotuner achieves a performance superior to those that use conventional static approaches and well-known autotuning heuristics.  ...  In this article, we focus on loop tiling, which plays an important role in performance tuning, and develop a novel framework that analytically models the load balance and empirically autotunes unpredictable  ...  They compared cache-oblivious methods that recursively split into smaller tiles through a divide-and-conquer strategy with typical single-level iteration space tiling.  ... 
doi:10.1145/3293449 fatcat:be4kecwhw5fnbi5krgabv7jqba

Energy-Efficient Computing for Extreme-Scale Science

David Donofrio, Leonid Oliker, John Shalf, Michael F. Wehner, Chris Rowen, Jens Krueger, Shoaib Kamil, Marghoob Mohiyuddin
2009 Computer  
To that end, we have developed Green Flash, an application-driven design that combines a many-core processor with novel alternatives to cache coherence and autotuning to improve the kernels' computational  ...  The computing industry has arrived at a rare inflection point: Fundamental principles of computer architecture are open to question, and new ideas are being explored.  ...  Acknowledgments We thank Mark Horowitz and the rest of the Smart Memories Team of Stanford University for early support and advice.  ... 
doi:10.1109/mc.2009.353 fatcat:qerwoknemnaivcn2j55l2oc7iu


Samuel Williams, Andrew Waterman, David Patterson
2009 Communications of the ACM  
We propose an easy-to-understand, visual performance model that offers insights to programmers and architects on improving parallel software and hardware for floating point computations.  ...  This research was sponsored in part by the Universal Parallel Computing Research Center, funded by Intel and Microsoft, and in part by the ASCR Office in the DOE Office of Science under contract number  ...  We'd like to thank FZ-Jülich and Georgia Tech for access to Cell blades.  ... 
doi:10.1145/1498765.1498785 fatcat:t4bx3edd5ba5hbfhd2xrpuo2si

Algebraic description and automatic generation of multigrid methods in SPIRAL

Matthias Bolten, Franz Franchetti, Paul H. J. Kelly, Christian Lengauer, Marcus Mohr
2017 Concurrency and Computation  
SPIRAL is an autotuning, program generation and code synthesis system that offers a fully automatic generation of highly optimized target codes, customized for the specific execution platform at hand.  ...  The key to (2) is that it is written as a so-called breakdown rule using "→" instead of "=" and that the Kronecker product "⊗", see below, is used to construct the sparse matrices from smaller DFTs and  ...  This has been cooperative work that started at the Dagstuhl Seminar Advanced Stencil-Code Engineering [33] in April 2015.  ... 
doi:10.1002/cpe.4105 fatcat:yf7wzyevvvfflcln5tpoykwhtq

OPESCI-FD: Automatic Code Generation Package for Finite Difference Models [article]

Tianjiao Sun
2016 arXiv   pre-print
We implement the 3D velocity-stress FD scheme as an example and demonstrate the advantages of usability, flexibility and accuracy of the framework.  ...  The design of OPESCI-FD aims to allow rapid development, analysis and optimisation of Finite Difference programs.  ...  PATUS [12] generates C source code for stencil computation on shared-memory CPU architectures.  ... 
arXiv:1605.06381v1 fatcat:u5omtm2swja6dodkrzzxldy7si

Algorithmic species

Cedric Nugteren, Pieter Custers, Henk Corporaal
2013 ACM Transactions on Architecture and Code Optimization (TACO)  
and complex memory partitioning.  ...  Similarly, parallelising compilers and source-to-source compilers can take threading and optimization decisions based on the same classification.  ...  This fits the "recursively partitioned" or "divide-and-conquer" skeleton. The stencil computation (listing 2) computes a result directly, making it fit the "task queue" or "farm" skeleton.  ... 
doi:10.1145/2400682.2400699 fatcat:gh6yxgfdufhh5lbfmrkuyzxkja

Optimization of Data Assignment for Parallel Processing in a Hybrid Heterogeneous Environment Using Integer Linear Programming

Tomasz Boiński, Paweł Czarnul
2021 Computer journal  
compute devices, including CPUs, GPUs and Intel Xeon Phis.  ...  The model considers an application that processes a large number of data chunks in parallel on various compute units and takes into account computations, communication including bandwidths and latencies  ...  processes with an average error of 5.4% and a maximum error of 23.3% and a divide-and-conquer application up to 1024 processes with an average error of 4.8% and a maximum error of 17.8% was presented.  ... 
doi:10.1093/comjnl/bxaa187 fatcat:m2wl52tugra2dfbspjareteuvu

A Refactoring Approach to Parallelism

Danny Dig
2011 IEEE Software  
This is tedious because it requires changing many lines of code, and it is error-prone and non-trivial because programmers need to ensure non-interference of parallel operations.  ...  Fortunately, refactoring tools can help reduce the analysis and transformation burden.  ...  ACKNOWLEDGMENT This work is partially funded by Intel and Microsoft through the UPCRC Center at Illinois.  ... 
doi:10.1109/ms.2011.1 fatcat:cuwlx6sopjgw5fvhxug6ixu2uq

An Auto-tuned Method for Solving Large Tridiagonal Systems on the GPU

Andrew Davidson, Yao Zhang, John D. Owens
2011 2011 IEEE International Parallel & Distributed Processing Symposium  
The multi-stage characteristic of our method, together with various workloads and GPUs of different capabilities, obligates an auto-tuning strategy to carefully select the switch points between computation  ...  We demonstrate that auto-tuning is a powerful tool that improves the performance by up to 5x, saves 17% and 32% of execution time on average respectively over static and dynamic tuning, and enables our  ...  Thanks also to the SciDAC Institute for Ultrascale Visualization, the HP Labs Innovation Research Program, and the National Science Foundation (Awards 0541448, 1017399, and 1032859) for funding, and to  ... 
doi:10.1109/ipdps.2011.92 dblp:conf/ipps/DavidsonZO11 fatcat:tpvrna62fvas5lpmbjdntqtj7u

Communication lower bounds and optimal algorithms for numerical linear algebra

G. Ballard, E. Carson, J. Demmel, M. Hoemmen, N. Knight, O. Schwartz
2014 Acta Numerica  
Some of these generalize known lower bounds for dense classical (O(n 3 )) matrix multiplication to all direct methods of linear algebra, to sequential and parallel algorithms, and to dense and sparse matrices  ...  Third, we identify or invent new algorithms for most linear algebra problems that do attain these lower bounds, and demonstrate large speed-ups in theory and practice.  ...  As discussed by Hoemmen (2010) , successively scaling the basis vectors (e.g., dividing by their Euclidean norms) is not possible in the CA variants, as it reintroduces global communication between SpMV  ... 
doi:10.1017/s0962492914000038 fatcat:43lzwu73vzbk3dvlq3zk5gydfy

Integrating State of the Art Compute, Communication, and Autotuning Strategies to Multiply the Performance of the Application Programm CPMD for Ab Initio Molecular Dynamics Simulations [article]

Tobias Klöffel, Gerald Mathias, Bernd Meyer
2020 arXiv   pre-print
MPI+OpenMP parallelization now overlaps computation and communication.  ...  Following the internal instrumentation of CPMD, all time critical routines have been revised to maximize the computational throughput and to minimize the communication overhead for optimal performance.  ...  The authors gratefully acknowledge the compute resources and support provided by the Erlangen Regional Computing Center (RRZE).  ... 
arXiv:2003.08477v1 fatcat:arngdqddszfpvobj7gxjsmshde
« Previous Showing results 1 — 15 out of 27 results