Filters








60 Hits in 3.9 sec

Liquid SIMD: Abstracting SIMD Hardware using Lightweight Dynamic Mapping

Nathan Clark, Amir Hormati, Sami Yehia, Scott Mahlke, Krisztian Flautner
2007 2007 IEEE 13th International Symposium on High Performance Computer Architecture  
SIMD instructions are expressed using a processor's baseline scalar instruction set, and light-weight dynamic translation maps the representation onto a broad family of SIMD accelerators.  ...  Additionally, we show that the hardware overhead of dynamic optimization is modest, hardware changes do not affect cycle time of the processor, and the performance impact of abstracting the SIMD accelerator  ...  Expressing SIMD instructions using the baseline instruction set provides an abstract software interface for the SIMD accelerators, which can be utilized through a lightweight dynamic translator.  ... 
doi:10.1109/hpca.2007.346199 dblp:conf/hpca/ClarkHYMF07 fatcat:k7a7acld7fczno64yy6yqunmyi

Vapor SIMD: Auto-vectorize once, run everywhere

Dorit Nuzman, Sergei Dyshel, Erven Rohou, Ira Rosen, Kevin Williams, David Yuste, Albert Cohen, Ayal Zaks
2011 International Symposium on Code Generation and Optimization (CGO 2011)  
Single-Instruction-Multiple-Data (SIMD) hardware is ubiquitous and markedly diverse, but can be difficult for JIT compilers to efficiently target due to resource and budget constraints.  ...  The scheme is composed of an aggressive, generic offline stage coupled with a lightweight, target-specific online stage.  ...  MOTIVATION Generating code for SIMD hardware has traditionally relied on target-specific manual optimization, using a plethora of intrinsic functions or aggressive offline compiler optimizations.  ... 
doi:10.1109/cgo.2011.5764683 dblp:conf/cgo/NuzmanDRRWYCZ11 fatcat:pawpjpuurzgyrpdwapv2qpl3uq

targetDP: an Abstraction of Lattice Based Parallelism with Portable Performance

Alan Gray, Kevin Stratford
2014 2014 IEEE Intl Conf on High Performance Computing and Communications, 2014 IEEE 6th Intl Symp on Cyberspace Safety and Security, 2014 IEEE 11th Intl Conf on Embedded Software and Syst (HPCC,CSS,ICESS)  
To achieve high performance on modern computers, it is vital to map algorithmic parallelism to that inherent in the hardware.  ...  Here we present targetDP (target Data Parallel), a lightweight programming layer that allows the abstraction of data parallelism for applications that employ structured grids.  ...  The new abstraction promotes optimal mapping of code to hardware thread-level parallelism (TLP) and instruction-level parallelism (ILP), via the partitioning of lattice-based parallelism and translation  ... 
doi:10.1109/hpcc.2014.212 dblp:conf/hpcc/GrayS14 fatcat:mq56eib7jjbb5devymdrovqn3i

Runtime Vectorization Transformations of Binary Code

Nabil Hallou, Erven Rohou, Philippe Clauss
2016 International journal of parallel programming  
For this purpose, we use open source frameworks that we have tuned and integrated to (1) dynamically lift the x86 binary into the Intermediate Representation form of the LLVM compiler, (2) abstract hot  ...  In fact, backward compatibility of ISA guarantees only the functionality, not the best exploitation of the hardware. In this work, we focus on maximizing the CPU efficiency for the SIMD extensions.  ...  As we operate in a dynamic environment, we are constrained to lightweight manipulations.  ... 
doi:10.1007/s10766-016-0480-z fatcat:2a3xnzyxdbfmnlxmjlcjalkax4

A Lightweight Approach to Performance Portability with targetDP [article]

Alan Gray, Kevin Stratford
2016 arXiv   pre-print
In this paper we describe targetDP, a lightweight abstraction layer which allows grid-based applications to target data parallel hardware in a platform agnostic manner.  ...  Leading HPC systems achieve their status through use of highly parallel devices such as NVIDIA GPUs or Intel Xeon Phi many-core CPUs.  ...  This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S.  ... 
arXiv:1609.01479v2 fatcat:n2a5rzfgqfdbdn3fui6srewuva

A lightweight approach to performance portability with targetDP

Alan Gray, Kevin Stratford
2016 The international journal of high performance computing applications  
In this paper we describe targetDP, a lightweight abstraction layer which allows gridbased applications to target data parallel hardware in a platform agnostic manner.  ...  Leading HPC systems achieve their status through use of highly parallel devices such as NVIDIA GPUs or Intel Xeon Phi many-core CPUs.  ...  This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S.  ... 
doi:10.1177/1094342016682071 fatcat:7qo4qg4ernhkbczsk2wpozdfae

Parallel, distributed and GPU computing technologies in single-particle electron microscopy

Martin Schmeisser, Burkhard C. Heisen, Mario Luettich, Boris Busche, Florian Hauer, Tobias Koske, Karl-Heinz Knauber, Holger Stark
2009 Acta Crystallographica Section D: Biological Crystallography  
Recent developments in information technology such as multicore, parallel and GPU processing can be used to overcome these limitations.  ...  tools and infrastructure needed to apply these paradigms are presented and a state-of-the-art infrastructure and solution strategy for moving scientific applications to the next generation of computer hardware  ...  This logical hierarchy is mapped to hardware design. Thread local memory is implemented in registers residing within the multiprocessors, which are mapped to individual SPs (not shown).  ... 
doi:10.1107/s0907444909011433 pmid:19564686 pmcid:PMC2703572 fatcat:puw6niofz5au3fzybp7fn7tjxa

Artificial neural networks in hardware: A survey of two decades of progress

Janardan Misra, Indranil Saha
2010 Neurocomputing  
(HNN), appearing in academic studies as prototypes as well as in commercial use.  ...  We outline underlying design approaches for mapping an ANN model onto a compact, reliable, and energy efficient hardware entailing computation and communication and survey a wide range of illustrative  ...  These applications demand dealing with large amounts of real-time multimedia data from interacting environment, using lightweight hardware with strict power constraints, without letting the computational  ... 
doi:10.1016/j.neucom.2010.03.021 fatcat:regzu6sshvekzd5wxcuaiytgqu

waLBerla: A block-structured high-performance framework for multiphysics simulations [article]

Martin Bauer, Sebastian Eibl, Christian Godenschwager, Nils Kohl, Michael Kuron, Christoph Rettinger, Florian Schornbaum, Christoph Schwarzmeier, Dominik Thönnes, Harald Köstler, Ulrich Rüde
2019 arXiv   pre-print
Multiple levels of parallelism on the core, on the compute node, and between nodes need to be exploited to make full use of the system.  ...  The framework uses meta-programming techniques to generate highly efficient code for CPUs and GPUs from a symbolic method formulation.  ...  A thin abstraction layer on top of SIMD intrinsics for QPX, SSE, AVX and AVX2 allows the user to highly optimize a compute kernel with very close control over the hardware.  ... 
arXiv:1909.13772v1 fatcat:b2iwdbbugjeebk3diazleysedi

The MANGO FET-HPC Project: An Overview

Jose Flich, Giovanni Agosta, Philipp Ampletzer, David Atienza Alonso, Alessandro Cilardo, William Fornaciari, Mario Kovac, Fabrice Roudet, Davide Zoni
2015 2015 IEEE 18th International Conference on Computational Science and Engineering  
Abstract-In this paper, we provide an overview of the MANGO project and its goal.  ...  HNs will contain a multi-chip mesh of powerefficient RISC cores augmented with custom vector resources (SIMD and lightweight GPU-like cores) as well as a dedicated memory architecture and a custom Network-on-Chip  ...  Fig. 1 . 1 MANGO Hardware Architecture B. Organization of the paper Fig. 2 . 2 MANGO Software Stack Fig. 3 . 3 Mapping Applications on the MANGO Platform  ... 
doi:10.1109/cse.2015.57 dblp:conf/cse/FlichAAACFKRZ15 fatcat:e6nezy5itrb33euoosuzgdw5qi

Designing application specific circuits with concurrent C# programs

David Greaves, Satnam Singh
2010 Eighth ACM/IEEE International Conference on Formal Methods and Models for Codesign (MEMOCODE 2010)  
Also, the compiled bytecode can be automatically converted into circuits using our Kiwi hardware synthesis system. 978-1-4244-7886-6/10/$26.00 ©2010 IEEE  ...  between hardware and software implementations compared to multi-model approaches.  ...  Some systems do use very lightweight concurrency mechanism to facilitate circuit modeling e.g. on Windows there is an implementation of SystemC that uses very lightweight fibers which are user scheduled  ... 
doi:10.1109/memcod.2010.5558627 dblp:conf/memocode/GreavesS10 fatcat:ui65qq42trdi5o2focjpcqqdy4

Early performance data on the Blue Matter molecular simulation framework

R. S. Germain, Y. Zhestkov, M. Eleftheriou, A. Rayshubskiy, F. Suits, T. J. C. Ward, B. G. Fitch
2005 IBM Journal of Research and Development  
We describe the parallel decomposition currently being used to target the Blue Gene/L machine and discuss the application-based trace tools used to analyze the performance of the application.  ...  comparison of the performance of the Ewald and the particle-particle particle-mesh (P3ME) methods, compare the measured performance of some key collective operations with the limitations imposed by the hardware  ...  We also acknowledge the contributions of the Blue Gene/L hardware and system software teams whose efforts and assistance made it possible for us to use the Blue Gene/L prototype hardware.  ... 
doi:10.1147/rd.492.0447 fatcat:bxvwkxajx5btxfko6vbhrx55oi

Coherent global market simulations for counterparty credit risk

Claudio Albanese
2010 2010 Workshop on High Performance Computational Finance at SC10 (WHPCF)  
The network bottleneck is bypassed by using heterogeneous boards with acceleration.  ...  It should also be used consistently both to simulate and to value all instruments. This article describes the Mathematics and the software architecture of a risk system that accomplishes this task.  ...  (e) MIMD and SIMD designs are characterized by radically different threading models: SSE2/SSE3/AVX primitives rule with CPUs while the lightweight, no-frills threading models in CUDA/OpenCL are used for  ... 
doi:10.1109/whpcf.2010.5671842 fatcat:lpkvckbwbrcwdebsm7v5owrgxm

MLIR: A Compiler Infrastructure for the End of Moore's Law [article]

Chris Lattner, Mehdi Amini, Uday Bondhugula, Albert Cohen, Andy Davis, Jacques Pienaar, River Riddle, Tatiana Shpeisman, Nicolas Vasilache, Oleksandr Zinenko
2020 arXiv   pre-print
MLIR facilitates the design and implementation of code generators, translators and optimizers at different levels of abstraction and also across application domains, hardware targets and execution environments  ...  point in design, semantics, optimization specification, system, and engineering. (2) evaluation of MLIR as a generalized infrastructure that reduces the cost of building compilers-describing diverse use-cases  ...  Instead, the system should maintain structure of computation and progressively lower to the hardware abstraction.  ... 
arXiv:2002.11054v2 fatcat:vuudltlsljbudbj7r6cc2kf754

Liquid Metal: Object-Oriented Programming Across the Hardware/Software Boundary [chapter]

Shan Shan Huang, Amir Hormati, David F. Bacon, Rodric Rabbah
Lecture Notes in Computer Science  
While multicores are here today, the future is likely to witness architectures that use reconfigurable fabrics (FPGAs) as coprocessors.  ...  This paper shows how to bridge the gap between programming software vs. hardware.  ...  Dynamic dispatch in hardware. One of the defining features of object-oriented paradigms is the dynamic dispatch of methods.  ... 
doi:10.1007/978-3-540-70592-5_5 fatcat:e2kqrul2frfxrc7j4okdhr6vfi
« Previous Showing results 1 — 15 out of 60 results