598 Hits in 5.1 sec

Hybrid Hardware/Software Floating-Point Implementations for Optimized Area and Throughput Tradeoffs

Jon J. Pimentel, Brent Bohnenstiehl, Bevan M. Baas
2017 IEEE Transactions on Very Large Scale Integration (vlsi) Systems  
Unsigned, shift carry, and leading zero detection (USL) support is added to a processor to augment an existing instruction set architecture and increase FP throughput with little area overhead.  ...  The proposed implementations are synthesized in 65-nm CMOS and integrated into small fixed-point processors with a RISC-like architecture.  ...  The digit-by-digit and nonrestoring algorithms are chosen for their low area impact, while the Newton-Raphson method is chosen for providing high throughput since the algorithm converges quadratically  ... 
doi:10.1109/tvlsi.2016.2580142 fatcat:zpc5nsuaobbhrhamz7ttyilipq

Tradeoff of FPGA design of floating-point transcendental functions

Daniel M. Munoz, Diego F. Sanchez, Carlos H. Llanos, Mauricio Ayala-Rincon
2009 2009 17th IFIP International Conference on Very Large Scale Integration (VLSI-SoC)  
[12] applied a Newton-Raphson method and [13] a high radix SRT division algorithm and a binary restoring square root algorithm.  ...  Notice that equations (1b) and (1c) can be computed in a parallel approach. 2) The Newton-Raphson algorithm for division The Newton-Raphson algorithm has two n-bits inputs N and D, that satisfy 1 ≤ N  ... 
doi:10.1109/vlsisoc.2009.6041365 fatcat:grf2zemy35av5dpmznzzu2kuei

An FPGA-Based Parallel Accelerator for Matrix Multiplications in the Newton-Raphson Method [chapter]

Xizhen Xu, Sotirios G. Ziavras, Tae-Gyu Chang
2005 Lecture Notes in Computer Science  
The Newton-Raphson (NR) iterative method is often enlisted for solving power flow analysis problems. However, it involves computation-expensive matrix multiplications (MMs).  ...  In this paper we propose an FPGA-based Hierarchical-SIMD (H-SIMD) machine with its codesign of the Hierarchical Instruction Set Architecture (HISA) to speed up MM within each NR iteration.  ...  The proposed multi-layered H-SIMD machine paired with an appropriate multilayered HISA software approach is effective for the host-FPGA architecture and can be synthetically used to speed up MM in NR iterations  ... 
doi:10.1007/11596356_47 fatcat:sh2kyryywrajvbxh7yoxjrc7se

Performance tuning of N-body codes on modern microprocessors: I. Direct integration with a hermite scheme on x86_64 architecture

Keigo Nitadori, Junichiro Makino, Piet Hut
2006 New Astronomy  
In subsequent papers, we will discuss other variations, including the combinations of N log N codes, single precision implementations, and performance on other microprocessors.  ...  We have succeeded in speeding up this pair-wise force calculation by factors between two and ten, depending on the code and the processor on which the code is run.  ...  Also, we use the fast approximate square root instruction and a Newton-Raphson iteration. This implementation is 88% faster than baseline.  ... 
doi:10.1016/j.newast.2006.07.007 fatcat:hbm5cw3oufaplpw6yfatq5g6za

Long operand arithmetic on instruction systolic computer architectures and its application in RSA cryptography [chapter]

Bertil Schmidt, Manfred Schimmler, Heiko Schröder
1998 Lecture Notes in Computer Science  
Instruction systolic arrays have been developed in order to combine the speed and simplicity of systolic arrays with the flexibility of MIMD parallel computer systems.  ...  It is shown how the new arithmetic leads to a high-speed implementation for RSA encryption and decryption.  ...  Division of Long Operands Division can be efficiently reduced to multiplication and subtraction by using the Newton-Raphson-method.  ... 
doi:10.1007/bfb0057948 fatcat:zqnmnpmkcvajlkf6gwuwwp5ttu

A Preliminary Study of Neural Network-based Approximation for HPC Applications [article]

Wenqian Dong, Anzheng Guolu, Dong Li
2018 arXiv   pre-print
Using two applications (the Newton-Raphson method and the Lennard-Jones (LJ) potential in LAMMP) for our case study, we achieve up to 2.7x and 2.46x speedup, respectively.  ...  Machine learning, as a tool to learn and model complicated (non)linear relationships between input and output data sets, has shown preliminary success in some HPC problems.  ...  Newton-Raphson method We use NNs of different topologies to replace the Newton-Raphson Method. We then study the accuracy and efficiency of the new Newton-Raphson method.  ... 
arXiv:1812.07561v1 fatcat:rk3mvjblenethja2t4qcqfumem

Simulation Acceleration of Image Filtering on CMOS Vision Chips Using Many-Core Processors

Gines Domenech-Asensi, Tom J. Kazmierski
2019 2019 Forum for Specification and Design Languages (FDL)  
Although the integration step is smaller than the required one by traditional simulation methods based on Newton-Raphson iterations, explicit methods do not require to compute complex calculations such  ...  The proposed technique has been implemented on a NVIDIA GPU and has been demonstrated simulating Gaussian filtering operations performed by a CMOS vision chip.  ...  Regarding the general purpose GPUs, the programming model CUDA, defines GPUs as computing devices with their own memory and able to run many threads in parallel.  ... 
doi:10.1109/fdl.2019.8876903 dblp:conf/fdl/Domenech-Asensi19 fatcat:op2hyg2uxzdala4nhbbux3icla

Some parallel methods for polynomial root-finding

A.J. Maeder, S.A. Wynton
1987 Journal of Computational and Applied Mathematics  
Some techniques for parallelizing such methods are identified and some examples are given.  ...  Parallelizations of various different methods for determining the roots of a polynomial are discussed. These include methods which locate a single root only as well as those which find all roots.  ...  secant and parallel Newton-Raphson methods.  ... 
doi:10.1016/0377-0427(87)90056-2 fatcat:z46g2nz6grarbehb7blaepsady

Controlling Optimization Software Packages with the Application of Parallel Computing

Samir Z. Guliyev
2018 Azerbaijan Journal of High Performance Computing  
The paper is devoted to the analysis of techniques and algorithms of controlling computational process of solution to complex optimization problems with the use of multiprocessor and/ or multicore computer  ...  We have developed automatic and dialog systems of control of an optimization process.  ...  4413862; ( ≈ 1.0000 $_7,i< ) 6 Newton-Raphson (second-order method) 4.45892579296002×10 Il ; 4952764; ( ≈ 1.0000 $_7,i< ) 7 Newton-Raphson (second-order method) 1.95412491898793×10 I7<; 5429344; ( ≈ 1.0000  ... 
doi:10.32010/26166127.2018. fatcat:twfdk4k5ejepxbcjo7hrkyrhb4

Energy and Delay Improvement via Decimal Floating Point Units

Hossam A. H. Fahmy, Ramy Raafat, Amira M. Abdel-Majeed, Rodina Samy, Tarek ElDeeb, Yasmin Farouk
2009 2009 19th IEEE Symposium on Computer Arithmetic  
Our Newton-Raphson based divider is over three times faster than the similar design previously reported.  ...  This paper presents new designs for decimal floating point (DFP) addition, multiplication, fused multiplyadd, division, and square root.  ...  This work also reports the first hardware implementation of the FMA and the fastest hardware DFP divider using Newton-Raphson iterations.  ... 
doi:10.1109/arith.2009.21 dblp:conf/arith/FahmyRASEF09 fatcat:6klu3s6kgbbuxcbfw73q2tkthm

Numerical methods and computers used in elastohydrodynamic lubrication [chapter]

B.J. Hamrock, J.H. Tripp
1984 Developments in Numerical and Experimental Methods Applied to Tribology  
The highlights of four general approaches (direct, inverse, gi'asiinverse, and Newton-RaChson) are sketched.  ...  Advantages and disadvantag p^ of these approaches are presented along with a flow chart showing some of the details of each.  ...  Flow diagram of Newton-Raphson method.  ... 
doi:10.1016/b978-0-408-22164-1.50005-3 fatcat:2majfcg7mbbyfk66z4dqzymbwy

A Square-Root-Free Matrix Decomposition Method for Energy-Efficient Least Square Computation on Embedded Systems

Fengbo Ren, Chenxin Zhang, Liang Liu, Wenyao Xu, Viktor Owall, Dejan Markovic
2014 IEEE Embedded Systems Letters  
However, traditional QR decomposition methods, such as Gram-Schmidt (GS), require high computational complexity and nonlinear operations to achieve high throughput, limiting their usage on resource-limited  ...  Up to 4 and 6.5 times improvement in energy-efficiency and throughput, respectively, can be achieved for small-size problems.  ...  Most existing work implements nonlinear operations using iterative approximation methods, such as Newton-Raphson, with word-length optimization applied [2] - [4] .  ... 
doi:10.1109/les.2014.2350997 fatcat:gqirkfu7xrftvcn6v5yp6exrni

Accelerating a fluvial incision and landscape evolution model with parallelism

Richard Barnes
2019 Geomorphology  
Tips for parallelization and a step-by-step guide to achieving it are given to help others achieve good performance with their own code.  ...  The new algorithm runs 43x faster (70s vs. 3,000s on a 10,000x10,000 input) than the previous state of the art and exhibits sublinear scaling with input size.  ...  Jack DeSlippe and Thorsten Kurth helped with an unused OpenMP implementation at an LBNL KNL Hackathon. Mat Colgrave of the PGI Compiler Group found bugs in both my code and the PGI compiler.  ... 
doi:10.1016/j.geomorph.2019.01.002 fatcat:kqqrw6b3hvcnvc6ozsxcfiavoa

GoDEL: A Multidirectional Dataflow Execution Model for Large-Scale Computing

Abhishek Kulkarni, Michael Lang, Andrew Lumsdaine
2011 2011 First Workshop on Data-Flow Execution Models for Extreme Scale Computing  
Implemented with efficiency and programmer productivity as its goals, we describe the syntax and semantics of the GoDEL language and discuss its implementation and runtime.  ...  As the emerging trends in hardware architecture guided by performance, power efficiency and complexity drive us towards massive processor parallelism, there has been a renewed interest in dataflow models  ...  This work was partly performed at the Ultrascale Systems Research Center (USRC), a collaboration between Los Alamos National Laboratory and the New Mexico Consortium (NMC).  ... 
doi:10.1109/dfm.2011.12 fatcat:oohxvm4xlndcrp4qkam2wxhi3e

A Generic Vectorization Scheme and a GPU Kernel for the Phylogenetic Likelihood Library

Fernando Izquierdo-Carrasco, Nikolaos Alachiotis, Simon Berger, Tomas Flouri, Solon P. Pissis, Alexandros Stamatakis
2013 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum  
We compare the performance of our GPU implementation for DNA data with a highly optimized x86 version of the PLL that relies on manually tuned AVX intrinsics.  ...  To this end, we are currently developing the Phylogenetic Likelihood Library (PLL) that implements functions to compute and optimize the phylogenetic likelihood score on evolutionary trees.  ...  With respect to future work, we plan to fully integrate the GPU kernel with the PLL and support all models and data types (e.g., protein data and the CAT model of rate heterogeneity).  ... 
doi:10.1109/ipdpsw.2013.103 dblp:conf/ipps/Izquierdo-CarrascoABFPS13 fatcat:shracbwghfg77kfvefvdoxvbp4
« Previous Showing results 1 — 15 out of 598 results