2,564 Hits in 6.3 sec

First Application of Lattice QCD to Pezy-SC Processor

Tatsumi Aoyama, Ken-Ichi Ishikawa, Yasuyuki Kimura, Hideo Matsufuru, Atsushi Sato, Tomohiro Suzuki, Sunao Torii
2016 Procedia Computer Science  
On single and multiple Pezy-SC devices, the sustained performance is measured for the matrix multiplications and a BiCGStab solver. We examine how the data layout affects the performance.  ...  We offload an iterative solver of a linear equation for a fermion matrix, which is in general the most time consuming part of the lattice QCD simulations.  ...  This work is supported in part by JSPS Grant-in-Aid for Scientific Research (No. 25400284).  ... 
doi:10.1016/j.procs.2016.05.457 fatcat:plilfzg7g5bodfrfshiabcutli

Vectorization techniques for the Blue Gene/L double FPU

J. Lorenz, S. Kral, F. Franchetti, C. W. Ueberhuber
2005 IBM Journal of Research and Development  
This paper presents vectorization techniques tailored to meet the specifics of the two-way single-instruction multiple-data (SIMD) double-precision floating-point unit (FPU), which is a core element of  ...  This paper focuses on the general-purpose basic-block vectorization and optimization methods as they are incorporated in the Vienna MAP vectorizer and optimizer.  ...  Johnson for years of pleasant and productive cooperation. Special thanks go to Manish Gupta, Jose´Moreira, and their group at the  ... 
doi:10.1147/rd.492.0437 fatcat:vdkdszwotvc5fg6l6r2n5de4pu

Exploiting On-chip Memory Bandwidth in the VIRAM Compiler [chapter]

David Judd, Katherine Yelick, Christoforos Kozyrakis, David Martin, David Patterson
2001 Lecture Notes in Computer Science  
It combines vector processing with mixed logic and DRAM to achieve high performance with relatively low energy, area, and design complexity.  ...  The second problem is to support that kinds of narrow data types that arise in media processing, including processing of 8 and 16-bit data.  ...  Avanti Corp. provided the CAD tools used for the design.  ... 
doi:10.1007/3-540-44570-6_8 fatcat:yo6zqdknxfcabgvbf6ghsh5qye

Vectorization and Minimization of Memory Footprint for Linear High-Order Discontinuous Galerkin Schemes [article]

Jean-Matthieu Gallard, Leonhard Rannabauer, Anne Reinarz, Michael Bader
2020 arXiv   pre-print
With the L2 cache bottleneck removed, we were able to exploit additional vectorization opportunities, by introducing a hybrid Array-of-Structure-of-Array data layout that solves the data layout conflict  ...  Starting from a generic scalar implementation of the numerical scheme, our first optimized variant applies state-of-the-art optimization techniques by vectorizing loops, improving the data layout and using  ...  We thank the Gauss Centre for Supercomputing e.V. (www. for providing computing time on the GCS supercomputers SuperMUC and SuperMUC-NG at Leibniz Supercomputing Centre (  ... 
arXiv:2003.12787v1 fatcat:p3fef27ugvbktoqeocpzil6yai

SIMD vectorization for the Lennard-Jones potential with AVX2 and AVX-512 instructions

Hiroshi Watanabe, Koh M. Nakagawa
2019 Computer Physics Communications  
Since the force-calculation kernel of the molecular dynamics method involves indirect access to memory, the data layout is one of the most important factors in vectorization.  ...  While the difference in performance between AoS and SoA is significant for the vectorization with AVX2, that with AVX-512 is minor.  ...  Noguchi for fruitful discussions.  ... 
doi:10.1016/j.cpc.2018.10.028 fatcat:2mupfu43ybcvnfvth556xwkd6y

Fast Matrix-Free Discontinuous Galerkin Kernels on Modern Computer Architectures [chapter]

Martin Kronbichler, Katharina Kormann, Igor Pasichnyk, Momme Allalen
2017 Lecture Notes in Computer Science  
State-of-the-art implementations of these kernels stress both arithmetics and memory transfer. The implementations of SIMD vectorization and shared-memory parallelization are detailed.  ...  a complex application code in fluid dynamics.  ...  The authors gratefully acknowledge the Gauss Centre for Supercomputing e.V.  ... 
doi:10.1007/978-3-319-58667-0_13 fatcat:qbipylq37fhbjibqp46itbfida

Porcupine: A Synthesizing Compiler for Vectorized Homomorphic Encryption [article]

Meghan Cowan, Deeksha Dangwal, Armin Alaghi, Caroline Trippel, Vincent T. Lee, Brandon Reagen
2021 arXiv   pre-print
Quill captures the underlying HE operator behavior that enables Porcupine to reason about the complex trade-offs imposed by the challenges and generate optimized, verified HE kernels.  ...  Homomorphic encryption (HE) is a privacy-preserving technique that enables computation directly on encrypted data.  ...  Data Layout. A data layout defines how the inputs and outputs are packed into ciphertext and plaintext vectors. In the example, we pack the input and output image Add( , ) → . + .  ... 
arXiv:2101.07841v1 fatcat:lpi5byepkreqnfag4u2aod3y6e

On the Efficacy and High-Performance Implementation of Quaternion Matrix Multiplication [article]

David Williams-Young, Xiaosong Li
2019 arXiv   pre-print
In this pursuit, an optimized software implementation of quaternion matrix multiplication will be presented and will be shown to outperform a vendor tuned implementation for the analogous complex matrix  ...  In this work, a case will be made for the efficacy of high-performance quaternion linear algebra software for appropriate problems.  ...  for aid in the tuning of HAXX.  ... 
arXiv:1903.05575v1 fatcat:2iqonbsxdngndoflqayw7zoovu

MD-Bench: A generic proxy-app toolbox for state-of-the-art molecular dynamics algorithms [article]

Rafael Ravedutti Lucio Machado, Jan Eitzinger, Harald Köstler, Gerhard Wellein
2022 arXiv   pre-print
Proxy-apps, or mini-apps, are simple self-contained benchmark codes with performance-relevant kernels extracted from real applications.  ...  The MD-Bench source code is understandable, extensible and suited for teaching, benchmarking and researching MD algorithms.  ...  AoS data layout was used with double precision floating point arithmetic.  ... 
arXiv:2207.13094v1 fatcat:5f4z5afshzg3bpw7pgmwqwdn3e

Hardware/compiler codevelopment for an embedded media processor

C. Kozyrakis, D. Judd, J. Gebis, S. Williams, D. Patterson, K. Yelick
2001 Proceedings of the IEEE  
consumption, and reduced design complexity.  ...  A vector architecture is used to exploit the data parallelism of multimedia programs, which allows the use of highly modular hardware and enables implementations that combine high performance, low power  ...  layout is preferred for them.  ... 
doi:10.1109/5.964446 fatcat:6cidyw7dfneorhf6wwaiu6lmyy

A Framework for Lattice QCD Calculations on GPUs

F.T. Winter, M.A. Clark, R.G. Edwards, B. Joo
2014 2014 IEEE 28th International Parallel and Distributed Processing Symposium  
The QCD Data-Parallel software layer provides data types and expressions with stencil-like operations suitable for lattice field theory and Chroma implements algorithms in terms of this high-level interface  ...  The QDP-JIT/PTX library, the reimplementation of the low-level layer, provides a framework for lattice QCD calculations for the CUDA architecture.  ...  Partial support for this work was provided through the Scientific Discovery through Advanced Computing (SciDAC) program funded by U.S.  ... 
doi:10.1109/ipdps.2014.112 dblp:conf/ipps/WinterCEJ14 fatcat:suuszzh5k5gizmsyuh453gsaoi

Fast matrix-free evaluation of discontinuous Galerkin finite element operators [article]

Martin Kronbichler, Katharina Kormann
2017 arXiv   pre-print
The sum factorization kernels are optimized by vectorization over several cells and faces and an even-odd decomposition of the one-dimensional compute kernels.  ...  Different algorithms and data structures for the implementation of operator evaluation are compared in an in-depth performance analysis.  ...  Vectorization layout for face integrals In this subsection, we present data structures to organize vectorization over several faces for a finite element mesh beyond the compute kernels presented in Section  ... 
arXiv:1711.03590v1 fatcat:lmryj7tupjhqxdmevxxyhohv7y

A New Kind of Data Centric Performance Portability Challenge Item [article]

Tim Germann
For other data-oriented applications, portability is a greater challenge, and widely applicable abstractions not as simple.  ...  Data science applications, including machine learning, optimization, graph analytics, and other large-scale data-driven computations, present a unique set of challenges to performance portability.  ...  (not global matrix) ✔ Backends compete for best performance, latency vs throughput, op<mize for order/device, use JIT, … backend kernels frontend apps libXSMM, AVX libCEED v0.7 ✔ Extensible  ... 
doi:10.6084/m9.figshare.14125964.v3 fatcat:ppdp4wlqarhpdkdux3dpnwksle

Performance comparison between Java and JNI for optimal implementation of computational micro-kernels [article]

Nassim A. Halli, Henri-Pierre Charles, Jean-François Mehaut
2014 arXiv   pre-print
In this paper we tackle this problem and we propose to do this analysis for a set of micro-kernels.  ...  We also investigate the impact on performance of several different optimization schemes which are vectorization, out-of-order optimization, data alignment, method inlining and the use of native memory  ...  Considering the Horner data-1st kernel, mixing vectorization and out-of-order allows performance to reach the CPU peak.  ... 
arXiv:1412.6765v1 fatcat:mvysul6sy5bcnabsfwh5usajla

General purpose lattice QCD code set Bridge++ 2.0 for high performance computing [article]

Yutaro Akahoshi, Sinya Aoki, Tatsumi Aoyama, Issaku Kanamori, Kazuyuki Kanaya, Hideo Matsufuru, Yusuke Namekawa, Hidekatsu Nemura, Yusuke Taniguchi
2021 arXiv   pre-print
The previous version of Bridge++ is implemented in double precision with a fixed data layout.  ...  Bridge++ is a general-purpose code set for a numerical simulation of lattice QCD aiming at a readable, extensible, and portable code while keeping practically high performance.  ...  This work is supported by JSPS KAKENHI (JP20K03961, JP21K03553), the MEXT as 'Program for Promoting Researches on the Supercomputer Fugaku' (Simulation for basic science: from fundamental laws of particles  ... 
arXiv:2111.04457v1 fatcat:cjwahovnz5ewlk55nuogc72l5e
« Previous Showing results 1 — 15 out of 2,564 results