Filters








2,150 Hits in 5.5 sec

Performance and Scalability Analysis of Cray X1 Vectorization and Multistreaming Optimization [chapter]

Sadaf Alam, Jeffrey Vetter
2005 Lecture Notes in Computer Science  
Compiler vectorization provides loop level parallelization and uses the vector processing hardware.  ...  In this paper, we analyze overall impact of loop-level compiler optimization on a scientific application called Parallel Ocean Program (POP).  ...  DE-AC05-00OR22725 with UT-Battelle, LLC. Accordingly, the U.S.  ... 
doi:10.1007/11428831_38 fatcat:q2ffopz6rndlpltpksmtv5cpba

Scalability Study of Polymorphic Register Files

Catalin Ciobanu, Georgi Kuzmanov, Georgi Gaydadjiev
2012 2012 15th Euromicro Conference on Digital System Design  
We study the scalability of multi-lane 2D Polymorphic Register Files (PRFs) in terms of clock cycle time, chip area and power consumption.  ...  We assume an implementation which stores data in a 2D array of linearly addressable memory banks, and consider one single-view and four suitable multi-view parallel access schemes which cover all basic  ...  Examples of specialized extensions of General Purpose Processors (GPPs) include Single Instruction Multiple data (SIMD) facilities, exploiting Data Level Parallelism, but also custom hardware support for  ... 
doi:10.1109/dsd.2012.116 dblp:conf/dsd/CiobanuKG12 fatcat:nntc4gruyzbobozrjztsohgmwu

Efficient Multicore Sparse Matrix-Vector Multiplication for FE Electromagnetics

D.M. Fernandez, D. Giannacopoulos, W.J. Gross
2009 IEEE transactions on magnetics  
We present a new sparse representation and a two level partitioning scheme for efficient sparse matrix-vector multiplication on multicore systems, and show results for a set of finite element matrices  ...  Index Terms-Finite element (FE), multicore, parallel computation, sparse matrices, sparse matrix-vector multiplication (SMVM).  ...  ACKNOWLEDGMENT This work was supported in part by the Natural Sciences and Engineering Research Council of Canada.  ... 
doi:10.1109/tmag.2009.2012640 fatcat:2y5dwp5vyzeazb26lldwhsgh4i

Libra: Tailoring SIMD Execution Using Heterogeneous Hardware and Dynamic Configurability

Yongjun Park, Jason Jong Kyu Park, Hyunchul Park, Scott Mahlke
2012 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture  
The Libra accelerator increases SIMD utility by blurring the divide between vector and instruction parallelism to support efficient execution of a wider range of loops, and it increases hardware utilization  ...  Experimental results show that the 32-lane Libra outperforms traditional SIMD accelerators by an average of 1.58x performance improvement due to higher loop coverage with 29% less energy consumption through  ...  This research is supported by Samsung Advanced Institute of Technology and the National Science Foundation under grants CCF-0916689 and CNS-0964478.  ... 
doi:10.1109/micro.2012.17 dblp:conf/micro/ParkPPM12 fatcat:3skvsbe2vbeujmh2ctwoqwmthe

Multicore Acceleration of CG Algorithms Using Blocked-Pipeline-Matching Techniques

David M. Fernandez, Dennis Giannacopoulos, Warren J. Gross
2010 IEEE transactions on magnetics  
We present a new blocked-pipeline-matched sparse representation and show speedup results for the conjugate gradient method by parallelizing the sparse matrix-vector multiplication kernel on multicore systems  ...  To realize the acceleration potential of multicore computing environments computational electromagnetics researchers must address parallel programming paradigms early in application development.  ...  ACKNOWLEDGMENT This work was supported in part by the Natural Sciences and Engineering Research Council of Canada.  ... 
doi:10.1109/tmag.2010.2044023 fatcat:vjfxthmd5rf6nagsbv4zes267y

Separable 2D Convolution with Polymorphic Register Files [chapter]

Cătălin B. Ciobanu, Georgi N. Gaydadjiev
2013 Lecture Notes in Computer Science  
This paper studies the performance of separable 2D convolution on multi-lane Polymorphic Register Files (PRFs).  ...  We compare the throughput of our PRF to the nVidia Tesla C2050 GPU. The results show that even in bandwidth constrained systems, multi-lane PRFs can outperform the GPU for 9 × 9 or larger mask sizes.  ...  Hwu and Nasser Salim Anssari for providing the nVidia Tesla C2050 GPU results. This work was supported by the European Commission in the context of FP7 FASTER project (#287804).  ... 
doi:10.1007/978-3-642-36424-2_27 fatcat:vfkhzugivvgjdgmu6yf25qevaa

Instruction scheduling heuristic for an efficient FFT in VLIW processors with balanced resource usage

Mounir Bahtat, Said Belkouch, Philippe Elleaume, Philippe Le Gall
2016 EURASIP Journal on Advances in Signal Processing  
The FFT was generated using an instruction-level scheduling heuristic.  ...  It is a modulo-based register-sensitive scheduling algorithm, which is able to compute an aggressively efficient sequence of VLIW instructions for the FFT, maximizing the parallelism rate and minimizing  ...  Author details 1 LGECOS Lab, ENSA-Marrakech of the Cadi Ayyad University, Marrakech, Morocco. 2 Thales Air Systems, Paris, France.  ... 
doi:10.1186/s13634-016-0336-0 fatcat:xdvzbjkqyje6bojjt22wcwnwce

The Case for Polymorphic Registers in Dataflow Computing

Cătălin Bogdan Ciobanu, Georgi Gaydadjiev, Christian Pilato, Donatella Sciuto
2017 International journal of parallel programming  
Data parallel solutions such as Polymorphic Register Files (PRFs) can potentially accelerate applications by facilitating high-speed, parallel access to performance-critical data.  ...  Our semi-automatic, compiler-based methodology generates customized PRFs and modifies the computational kernels to efficiently exploit them.  ...  Hwu and Nasser Salim Anssari from the University of Illinois at Urbana-Champaign for assisted us with obtaining the NVIDIA Tesla C2050 2D separable convolution results.  ... 
doi:10.1007/s10766-017-0494-1 fatcat:bcttuesbpbhp7jrtcv5b5kl5hi

Software-based MPEG-2 encoding system with scalable and multithreaded architecture

Ishfaq Ahmad, Dick-Kwong Yeung, Weiguo Zheng, Shehzad Mehmood, Howard J. Siegel
2001 Commercial Applications for High-Performance Computing  
The proposed multithreaded encoder exploits temporal parallelism in MPEG video sequences with small overhead.  ...  The highlights of the proposed work include an algorithm for enhancing the efficiency of motion estimation, speeding up the computation of motion estimation and DCT with Intel's SIMD (Single Instruction  ...  These 64-bit quantities are stored in a 64-bit SIMD register and processed by a single instruction in a data parallel fashion.  ... 
doi:10.1117/12.434875 fatcat:he3avhetanasleh3gfbqu6e2ay

An Exploration of Performance Attributes for Symbolic Modeling of Emerging Processing Devices [chapter]

Sadaf R. Alam, Nikhil Bhatia, Jeffrey S. Vetter
2007 Lecture Notes in Computer Science  
Using our scheme, the performance prediction error rates for a scientific calculation are reduced from over 200% to less than 25%.  ...  Vector, emerging (homogenous and heterogeneous) multi-core and a number of accelerator processing devices potentially offer an order of magnitude speedup for scientific applications that are capable of exploiting  ...  Acknowledgements The submitted manuscript has been authored by a contractor of the U.S. Government under Contract No. DE-AC05-00OR22725. Accordingly, the U.S.  ... 
doi:10.1007/978-3-540-75444-2_64 fatcat:uvq6q5ze3jhtfftw22hwca7jiy

OUTRIDER

Neal Clayton Crago, Sanjay Jeram Patel
2011 SIGARCH Computer Architecture News  
Outrider enables a single thread of execution to be presented to the architecture as multiple decoupled instruction streams that separate memory-accessing and memory-consuming instructions.  ...  The key insight is that by decoupling the instruction streams, the processor pipeline can tolerate memory latency in a way similar to out-of-order designs while relying on a low-complexity in-order micro-architecture  ...  ACKNOWLEDGEMENTS The authors acknowledge the support of the Semiconductor Research Corporation (SRC).  ... 
doi:10.1145/2024723.2000079 fatcat:2ny5ydqgmffkvglkm2b2v6fxka

OUTRIDER

Neal Clayton Crago, Sanjay Jeram Patel
2011 Proceeding of the 38th annual international symposium on Computer architecture - ISCA '11  
Outrider enables a single thread of execution to be presented to the architecture as multiple decoupled instruction streams that separate memory-accessing and memory-consuming instructions.  ...  The key insight is that by decoupling the instruction streams, the processor pipeline can tolerate memory latency in a way similar to out-of-order designs while relying on a low-complexity in-order micro-architecture  ...  ACKNOWLEDGEMENTS The authors acknowledge the support of the Semiconductor Research Corporation (SRC).  ... 
doi:10.1145/2000064.2000079 dblp:conf/isca/CragoP11 fatcat:w56fto3w4vgoxamabvgcrkb2z4

Characterization of ILP Distribution for NASA NAS Parallel Benchmarks

Abdullah I. Almojel
2004 Journal of King Saud University: Computer and Information Sciences  
The requirements suggest upper limits on the resources needed for efficient processors. In this study, we also examine non-uniformities in the distribution of instruction-level parallelism.  ...  Several nonuniformities in instruction-level parallelism are investigated including variation between benchmark class and by instruction class within benchmark.  ...  benchmark has the least amount of instruction-level parallelism whereas appbt has the most amount of instruction-level parallelism.  ... 
doi:10.1016/s1319-1578(04)80008-9 fatcat:e3llghhrhfd33byjkarr4tccp4

Towards Resiliency Evaluation of Vector Programs

Vishal Chandra Sharma, Ganesh Gopalakrishnan, Sriram Krishnamoorthy
2016 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)  
., LLVM-level) code generation to handle full and partial vectorization, modern compilers exploit (and explicate in their code-documentation) critical invariants.  ...  how faults affect vector instruction sets.  ...  , and may, in the grand scheme of things, provide the right kind and level of solution.  ... 
doi:10.1109/ipdpsw.2016.187 dblp:conf/ipps/SharmaGK16 fatcat:fcu76rhn7vhdxpqdgyrfxfvusu

The BAGEL assembler generation library

Peter A. Boyle
2009 Computer Physics Communications  
It provides high performance on the QCDOC, BlueGene/L and BlueGene/P parallel computer architectures that are popular in the the field of lattice QCD.  ...  The code includes a complete conjugate gradient implementation for the Wilson and Domain Wall fermion actions, making it easy to use for third party codes including the Jefferson Laboratory's CHROMA, UKQCD's  ...  cache memory strategy produces high reuse at the L2 cache level but with only short sequences of contiguous reads.  ... 
doi:10.1016/j.cpc.2009.08.010 fatcat:vg6w2krmmfglzls5qplqbvftlu
« Previous Showing results 1 — 15 out of 2,150 results