30 Hits in 4.1 sec

Evaluation of the Stretch S6 Hybrid Reconfigurable Embedded CPU Architecture for Power-Efficient Scientific Computing

Thang Viet Huynh, Manfred Mücke, Wilfried N. Gansterer
2012 Procedia Computer Science  
We show how a Stretch S6 hybrid reconfigurable CPU (S6) can be extended to natively support double precision floating-point arithmetic.  ...  We evaluate if the superlinear performance improvement of floating-point multiplication on reconfigurable fabrics can be exploited in the framework of a hybrid reconfigurable CPU.  ...  [17] demonstrated implementation of a double precision floating-point FMA operation within the reconfigurable fabric of the commercially available Stretch S6 hybrid reconfigurable CPU.  ... 
doi:10.1016/j.procs.2012.04.021 fatcat:cn6cu3waabacpepdwvd6rb4ruu

High-Performance Computing Applications on Novel Architectures

Volodymyr Kindratenko, George K. Thiruvathukal, Steven Gottlieb
2008 Computing in science & engineering (Print)  
The IBM PowerXCell 8i-based Roadrunner supercomputer recently acquired by Los Alamos National Laboratory (LANL) is the first to achieve a petaflop per second on the Linpack benchmark; it's based on a new  ...  G u e s t E d i t o r s ' I n t r o d u c t i o n Computing in SCienCe & engineering 13 T his is an exciting time for highperformance computing (HPC) on novel architectures.  ...  It natively supports double-precision floating-point arithmetic and is designed to execute application kernels that lend themselves to vector processing.  ... 
doi:10.1109/mcse.2008.149 fatcat:pdoldufmsfepheujc66z6hj4ma

Low Energy Consumption on Post-Moore Platforms for HPC Research

Pablo Josue Rojas Yepes
2021 Avances en Ciencias e Ingenierías  
a decade ago was only found on a Server.  ...  on a large scale, these devices are compared in different tests, presenting advantages such as its performance per watt consumed, smart form, among others.  ...  HPL (High-Performance Linpack) [10] solves a random dense linear system in double precision (64 bits) arithmetic on distributed-memory computers.  ... 
doi:10.18272/aci.v13i2.2108 fatcat:nnnar5opyfbobczuxgh2hqnppq

Solving global shallow water equations on heterogeneous supercomputers

Haohuan Fu, Lin Gan, Chao Yang, Wei Xue, Lanning Wang, Xinliang Wang, Xiaomeng Huang, Guangwen Yang, Juan A. Añel
2017 PLoS ONE  
With optimizations on both computing and memory access patterns, we manage to achieve around 8 to 20 times speedup when comparing one hybrid GPU or MIC node with one CPU node with 12 cores.  ...  On heterogeneous supercomputers, such as Tianhe-1A and Tianhe-2, our framework is capable of achieving ideally linear scaling efficiency, and sustained double-precision performances of 581 Tflops on Tianhe  ...  The double-precision FPGA design is Shallow water equations solvers on supercomputers even slower than a pure CPU version, due to the limit of the bandwidth among different FPGA cards.  ... 
doi:10.1371/journal.pone.0172583 pmid:28282428 pmcid:PMC5345762 fatcat:jvow4idrzndbffmdxu7q5k5jby

Online codesign on reconfigurable platform for parallel computing

Clément Foucher, Fabrice Muller, Alain Giulieri
2013 Microprocessors and microsystems  
In this article, we present an approach allowing a hardware/software codesign of applications in which implementation can be chosen at run time depending on available resources.  ...  Reconfigurable hardware offers new ways of accelerating computing by implementing hardware accelerators at run time.  ...  In our approach, the application is only a set of kernels, with no precision on their implementations.  ... 
doi:10.1016/j.micpro.2011.12.007 fatcat:bnvnv5dhongzbaxamkl5tjvezm

A Fine-Grained Pipelined Implementation of LU Decomposition on SIMD Processors [chapter]

Kai Zhang, ShuMing Chen, Wei Liu, Xi Ning
2013 Lecture Notes in Computer Science  
This paper proposes a fine-grained pipelined implementation of LU decomposition on SIMD processors.  ...  Experimental results show that the proposed technology can achieve a speedup of 1.04x to 1.82x over the native algorithm and can achieve about 89% of the peak performance on the SIMD processor.  ...  On the CPU/GPU hybrid architecture, the performance can be high up to 1000 GFLOPs for single precision as well as 500 GFLOPs (1 GFLOPS = 10 9 Floating Point Operations/second) for double precision.  ... 
doi:10.1007/978-3-642-40820-5_4 fatcat:ysyroxx5abgixozjkpj5s7wyam

State-of-the-art in Heterogeneous Computing

Andre R. Brodtkorb, Christopher Dyken, Trond R. Hagen, Jon M. Hjelmervik, Olaf O. Storaasli
2010 Scientific Programming  
Furthermore, we present a qualitative and quantitative comparison of the architectures, and give our view on the future of heterogeneous computing.  ...  Node level heterogeneous architectures have become attractive during the last decade for several reasons: compared to traditional symmetric CPUs, they offer high peak performance and are energy and/or  ...  Assuming a similar clock frequency as the C1060, one can expect roughly double the performance in single precision, and a dramatic improvement in double precision performance.  ... 
doi:10.1155/2010/540159 fatcat:xu4n5ubgfzh3bobd445cmg7qyu

Cloud benchmarking in bare-metal, virtualized, and containerized execution environments

Soheil Mazaheri, Yong Chen, Elham Hojati, Alan Sill
2016 2016 4th International Conference on Cloud Computing and Intelligence Systems (CCIS)  
It is a portable implementation of the High-Performance LINPACK Benchmark for Distributed- HPL is a software package that solves a (random) dense linear system in double precision (64 bits) arithmetic  ...  The benchmarks is discussed in the LINPACK section in more details [33] .  HPL (High Performance LINPACK): It is a portable implementation of LINPACK.  ... 
doi:10.1109/ccis.2016.7790286 dblp:conf/ccis/MazaheriCHS16 fatcat:5el3ngbgarcfzgs7duxcdrq2xu

Delivering Parallel Programmability to the Masses via the Intel MIC Ecosystem: A Case Study

Kaixi Hou, Hao Wang, Wu-chun Feng
2014 2014 43rd International Conference on Parallel Processing Workshops  
We also observe that the identically optimized code on MIC can outperform its CPU counterpart by up to 3.2-fold.  ...  When compared with the default OpenMP Floyd-Warshall parallel implementation, we still achieve a 6.4-fold speedup.  ...  [4] optimize the Linpack benchmark on Intel Xeon Phi using both of the native and hybrid implementation based on a dynamic scheduling scheme. Wende et al.  ... 
doi:10.1109/icppw.2014.44 dblp:conf/icppw/HouWF14 fatcat:5rn33ekzvncyrpubcrokyy2kpa

A hybrid architecture for bioinformatics

Bertil Schmidt, Heiko Schröder, Manfred Schimmler
2002 Future generations computer systems  
For a successful implementation on a multi-device hybrid-parallel architecture, the exact CPU and PCI Express topologies have to therefore be examined.  ...  For double precision, the value should be aligned to 64 bit boundaries. For the cost of 32 wasted padding bits, the double precision score can be properly aligned.  ... 
doi:10.1016/s0167-739x(02)00058-4 fatcat:ktrtocmzh5hkzgdprusjxddmbi

Best Practice Guide - Modern Accelerators

João Bispo, Jorge G. Barbosa, Pedro Filipe Silva, Cristian Morales, Mirko Myllykoski, Pedro Ojeda-May, Milosz Bialczak, Mariusz Uchronski, Adam Wlodarczyk, Peter Wauligmann, Ezhilmathi Krishnasamy, Sebastien Varrette (+2 others)
2021 Zenodo  
One of such is the offered greater computational throughput as compared to stand-alone Central Processing Units (CPUs), which is driven by the highly parallel architectural design of accelerators.  ...  In fact, this is one of the main reasons that the current Top500 list continues to be enriched with various accelerated systems.  ...  Software structure on SX-Aurora inspired by [81] Figure 27 . 27 Figure 27. Three possible execution modes: native, hybrid and VH offloading inspired by [81] / / CUDA kernel.  ... 
doi:10.5281/zenodo.5839488 fatcat:w4k7sdcwlbbabpwavwhawkhdxm

High-performance computing systems: Status and outlook

J. J. Dongarra, A. J. van der Steen
2012 Acta Numerica  
, which has been one of their most remarkable characteristics.  ...  This article describes the current state of the art of high-performance computing systems, and attempts to shed light on near-future developments that might prolong the steady growth in speed of such systems  ...  The other hybrid design that has found some favour is one based on a linking between a commodity CPU and a graphical processing unit (GPU) accelerator.  ... 
doi:10.1017/s0962492912000050 fatcat:n6yodkox5zb6xmlep6gvayud2m

Multi Objective Optimization of HPC Kernels for Performance, Power, and Energy [chapter]

Prasanna Balaprakash, Ananta Tiwari, Stefan M. Wild
2014 Lecture Notes in Computer Science  
Each core offers four-way simultaneous multithreading (SMT) and 512-bit-wide SIMD vectors, which corresponds to 8 double-precision or 16 single-precision floating-point numbers.  ...  Kodi, A., Louri, A.: Performance adaptive power-aware reconfigurable optical inter- connects for high-performance computing (HPC) systems.  ... 
doi:10.1007/978-3-319-10214-6_12 fatcat:y2vttotb25g27k53r5ft4jiwgy

Reconfigurable computing for large-scale graph traversal algorithms

Brahim Betkaoui, Wayne Luk
Our approach is based on a reconfigurable hardware architecture which exploits the capabilities of both FPGAs (Field-Programmable Gate Arrays) and a multi-bank parallel memory subsystem.  ...  A summary of our four contributions is as follows. First, a reconfigurable computing approach to accelerate large-scale graph traversal algorithms.  ...  The performance of the HC-1 is about the same as a single precision GPU implementation, and is twice as fast as the double precision version.  ... 
doi:10.25560/25049 fatcat:whupl3nn2ndinoseorv7y5c2iy

Eurolab-4-HPC Long-Term Vision on High-Performance Computing [article]

Theo Ungerer, Paul Carpenter
2018 arXiv   pre-print
This document presents the "EuroLab-4-HPC Long-Term Vision on High-Performance Computing" of August 2017, a road mapping effort within the EC CSA1 Eurolab-4-HPC that targets potential changes in hardware  ...  The proposal on research topics is derived from the report and discussions within the road mapping expert group.  ...  Recent work on low-precision implementations of backprop-based neural nets [25] suggests that between 8 and 16 bits of precision can suffice for using or training DNNs with backpropagation.  ... 
arXiv:1807.04521v1 fatcat:5neetrgubjhnvcajcktpkohrzq
« Previous Showing results 1 — 15 out of 30 results