Filters








7,820 Hits in 1.6 sec

Performance optimization of irregular codes based on the combination of reordering and blocking techniques

J.C. Pichel, D.B. Heras, J.C. Cabaleiro, F.F. Rivera
2005 Parallel Computing  
The product of a sparse matrix by a dense vector (SpM×V ) is the code studied on different monoprocessors and distributed memory multiprocessors.  ...  The combination of techniques based on reordering data with classic code restructuring techniques for increasing the locality in the execution of sparse algebra codes is studied in this paper.  ...  vector x, improving the spatial locality in the accesses to that vector.  ... 
doi:10.1016/j.parco.2005.04.012 fatcat:6dy7w24f35hatkpfucvkyaxvwi

Increasing memory bandwidth for vector computations [chapter]

Sally A. McKee, Steven A. Moyer, Wm. A. Wulf, Charles Hitchcock
1994 Lecture Notes in Computer Science  
The traditional scalar processor concern has been to minimize memory latency in order to maximize processor performance.  ...  After defining access ordering, our technique for improving vector memory bandwidth, we describe a hardware Stream Memory Controller (SMC) used to perform access ordering dynamically at run time, and discuss  ...  Reordering can optimize accesses to exploit the underlying memory architecture.  ... 
doi:10.1007/3-540-57840-4_26 fatcat:cs5dt25iffg2jhjtuyyw6a22ji

Three-level hybrid vs. flat MPI on the Earth Simulator: Parallel iterative solvers for finite-element method

Kengo Nakajima
2005 Applied Numerical Mathematics  
Introduction SMP Cluster Architecture and Hybrid Parallel Programming Model Recent technological advances have allowed increasing numbers of processors to have access to a single memory space in a cost-effective  ...  This is achieved by allowing each thread to access data provided by other threads directly by accessing the shared memory instead of using message passing.  ...  memory access and sufficiently innermost long loops.  ... 
doi:10.1016/j.apnum.2004.09.025 fatcat:h5txjs4pynhx5jgr3njxqbwvuq

Increasing data reuse of sparse algebra codes on simultaneous multithreading architectures

J. C. Pichel, D. B. Heras, J. C. Cabaleiro, F. F. Rivera
2009 Concurrency and Computation  
This puts a lot of stress on the memory hierarchy, and a poor locality, both inter-thread and intra-thread, may become a major bottleneck in the performance of a code.  ...  This paper proposes a data reordering technique specially tuned for this kind of architectures and codes. It is based on a locality model developed by the authors in previous works.  ...  Note that the elements of each vector are stored in contiguous memory locations.  ... 
doi:10.1002/cpe.1404 fatcat:2sk2h74y4jabzggxihhewjr3x4

TOWARDS A FAST PARALLEL SPARSE MATRIX-VECTOR MULTIPLICATION

ROMAN GEUS, STEFAN RÖLLIN
2000 Parallel Computing  
The sparse matrix-vector product is an important computational kernel that runs ineffectively on many computers with super-scalar RISC processors.  ...  In this paper we analyse the performance of the sparse matrix-vector product with symmetric matrices originating from the FEM and describe techniques that lead to a fast implementation.  ...  Because of the smaller bandwidth it is likely that during matrix-vector multiplication vector elements that are accessed in a particular matrix row will be accessed again in the following row.  ... 
doi:10.1142/9781848160170_0036 fatcat:o7dm23gijvfljedo4nsmwbjdxy

Speculative dynamic vectorization for HW/SW co-designed processors

Rakesh Kumar, Alejandro Martínez, Antonio González
2012 Proceedings of the 21st international conference on Parallel architectures and compilation techniques - PACT '12  
These processors utilize dynamic optimizations to improve the performance. However, vectorization, one of the most potent optimizations, has not yet received the deserved attention.  ...  This paper presents a speculative dynamic vectorization algorithm to explore its potential.  ...  • they access overlapping memory locations, an exception in raised.  ... 
doi:10.1145/2370816.2370895 dblp:conf/IEEEpact/KumarMG12 fatcat:umovrcwwqjdt5hajzgsoadnedm

Reordering Algorithms for Increasing Locality on Multicore Processors

Juan C. Pichel, David E. Singh, Jesús Carretero
2008 2008 10th IEEE International Conference on High Performance Computing and Communications  
In order to efficiently exploit available parallelism, multicore processors must address contention for shared resources as cache hierarchy.  ...  Likewise, a comparison of our proposal with some standard reordering techniques is included in the paper.  ...  Note that elements of each vector are stored in contiguous memory locations.  ... 
doi:10.1109/hpcc.2008.96 dblp:conf/hpcc/PichelSC08 fatcat:thxmxf73orcndpadu5gjbrutnm

Increasing the Locality of Iterative Methods and Its Application to the Simulation of Semiconductor Devices

J.C. Pichel, D.B. Heras, J.C. Cabaleiro, A.J. García-Loureiro, F.F. Rivera
2009 The international journal of high performance computing applications  
The main kernel of the iterative methods is the sparse matrix-vector multiplication which frequently demands irregular data accesses.  ...  Noticeable reductions in the execution time required by the simulations are observed when using our reordered matrices in comparison with the original simulator.  ...  Introduction As a result of the important increase in the performance of the processors over the years, the gap between processors and memory has been widened.  ... 
doi:10.1177/1094342009106416 fatcat:ilvouiq345by3ioo4245ircqy4

Efficient Parallel Nonnegative Least Squares on Multicore Architectures

Yuancheng Luo, Ramani Duraiswami
2011 SIAM Journal on Scientific Computing  
We use a reordering strategy of the columns in the decomposition to reduce computation and memory access costs.  ...  The underlying parameters of the model form a set n variables in a n × 1 vector Suppose that the observed data are linear functions of the underlying parameters in the model, then the function's values  ...  It is also the slowest memory to access as a single query from a multi-processor has a 400 to 600 clock cycles latency on a cache-miss.  ... 
doi:10.1137/100799083 fatcat:72w2p7urdjej5owy6ao72wqiwm

Hardware-only stream prefetching and dynamic access ordering

Chengqiang Zhang, Sally A. McKee
2000 Proceedings of the 14th international conference on Supercomputing - ICS '00  
Many researchers have studied hardware prefetching in its various forms. Others have examined dynamic memory scheduling to help bridge the performance gap between processors and DRAM memory systems.  ...  stream data, and the latencies can be reduced by reordering stream accesses to exploit parallelism and locality within the DRAMs.  ...  ACKNOWLEDGMENTS This research was sponsored in part by National Science Foundation award 9806043. The authors thank Steve Reinhardt and Wei-fen Lin for providing the initial Rambus model.  ... 
doi:10.1145/335231.335247 dblp:conf/ics/ZhangM00 fatcat:ipbfsbscoff3rdfydk7swwgisq

CONFLICT-FREE STRIDES FOR VECTORS IN MATCHED MEMORIES

MATEO VALERO, TOMÁS LANG, JOSÉ M. LLABERÍA, MONTSE PEIRON, JUAN J. NAVARRO, EDUARD AYGUADÉ
1991 Parallel Processing Letters  
Address transformation schemes, such as skewing and linear transformations, have been proposed to achieve conflict-free access to one family of strides in vector processors with matched memories.  ...  The basic idea is to perform an out-of-order access to vectors of fixed length, equal to that of the vector registers of the processor.  ...  Introduction To have a sufficient memory bandwidth, the memory of vector processors is organized as several modules that can be accessed simultaneously.  ... 
doi:10.1142/s0129626491000045 fatcat:fopzu42uqjeiroaqvvvauke5py

Reordering Memory Bus Transactions for Reduced Power Consumption [chapter]

Bruce R. Childers, Tarun Nakra
2001 Lecture Notes in Computer Science  
Using MPOWER, we measured the effectiveness of reordering memory accesses on switching activity.  ...  In this paper, we study the impact of reordering memory bus traffic on reducing bus switching activity and power consumption.  ...  In this scheme, a reordering vector is associated with every cache line that indicates the order in which individual cache line words should be accessed.  ... 
doi:10.1007/3-540-45245-1_10 fatcat:ndcksq4o7fhadgglbc5d2botva

Speculative dynamic vectorization to assist static vectorization in a HW/SW co-designed environment

Rakesh Kumar, Alejandro Martinez, Antonio Gonzalez
2013 20th Annual International Conference on High Performance Computing  
However, compilers inability to reorder ambiguous memory references severely limits vectorization opportunities, especially in pointer rich applications.  ...  We present a speculative dynamic vectorization algorithm that speculatively reorders ambiguous memory references to uncover vectorization opportunities.  ...  • they access overlapping memory locations, an exception in raised.  ... 
doi:10.1109/hipc.2013.6799102 dblp:conf/hipc/KumarMG13 fatcat:on5q3kwpnnbi7kfhxptf4fm4xm

Design and implementation of a parallel unstructured Euler solver using software primitives

R. Das, D. J. Mavriplis, J. Saltz, S. Gupta, R. Ponnysamy
1994 AIAA Journal  
The overall solution efficiency is compared with that obtained on the CRAY-YMP vector supercomputer.  ...  This paper is concerned with the implementation of a three-dimensional unstructuredgrid Euler-solver on massively parallel distributed-memory computer architectures.  ...  The resulting data access pattern after RCM reordering in shown in Figure 3 .  ... 
doi:10.2514/3.12012 fatcat:ra2dv3yc7rgkzel54y4k6lokfy

Efficient Shared-Memory Implementation of High-Performance Conjugate Gradient Benchmark and its Application to Unstructured Matrices

Jongsoo Park, Mikhail Smelyanskiy, Karthikeyan Vaidyanathan, Alexander Heinecke, Dhiraj D. Kalamkar, Xing Liu, Md. Mosotofa Ali Patwary, Yutong Lu, Pradeep Dubey
2014 SC14: International Conference for High Performance Computing, Networking, Storage and Analysis  
Based on available parallelism, our Xeon Phi shared-memory implementation of Gauss-Seidel smoother selectively applies block multi-color reordering.  ...  Key computation, data access, and communication pattern in HPCG represent building blocks commonly found in today's HPC applications.  ...  Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors.  ... 
doi:10.1109/sc.2014.82 dblp:conf/sc/ParkSVHKLPLD14 fatcat:ktiisywie5hhznon5qc2tuydoa
« Previous Showing results 1 — 15 out of 7,820 results