Filters








1,209 Hits in 5.7 sec

PERFORMANCE STUDY OF LU FACTORIZATION WITH LOW COMMUNICATION OVERHEAD ON MULTIPROCESSORS

F. DESPREZ, J. J. DONGARRA, B. TOURANCHEAU
1995 Parallel Processing Letters  
In this paper, we make e cient use of asynchronous communications on the LU decomposition algorithm with pivoting and a column-scattered data decomposition to derive precise computational complexities.  ...  We then compare these results with experiments on the Intel iPSC/860 and Paragon machines and show that very good performances can be obtained on a ring with asynchronous communications.  ...  Introduction This paper presents an analytical estimation of the LU factorization algorithm on a distributed-memory message-passing multiprocessor.  ... 
doi:10.1142/s012962649500014x fatcat:patgufmkkjhuth73moe45fikeq

Comparative evaluation of fine- and coarse-grain approaches for software distributed shared memory

S. Dwarkadas, K. Gharachorloo, L. Kontothanassis, D.J. Scales, M.L. Scott, R. Stets
1999 Proceedings Fifth International Symposium on High-Performance Computer Architecture  
the advent of higher performance systems.  ...  We do work in the design, fabrication and packaging of hardware; language processing and scaling issues in system software design; and the exploration of new applications areas that are opening up with  ...  Shasta's performance on Ilink is affected by three factors: the checking overhead, the small communication granularity, and the use of an eager protocol.  ... 
doi:10.1109/hpca.1999.744377 dblp:conf/hpca/DwarkadasGKSSS99 fatcat:3ngvqrvmofh2jmv3twk7ujtvwa

Exploration of Distributed Shared Memory Architectures for NoC-based Multiprocessors

Matteo Monchiero, Gianluca Palermo, Cristina Silvano, Oreste Villa
2006 2006 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation  
In this paper, a distributed shared memory architecture has been explored, that is suitable for low-power on-chip multiprocessors based on NoC.  ...  Experimental results show the impact of different NoC topologies and distributed shared memory configurations for a selected set of parallel benchmark applications from the power/performance perspective  ...  Interconnection The interconnect is the key element of the multiprocessor system, since it provides low latency communication layer, capable of minimizing the overhead due to thread spawning and synchronization  ... 
doi:10.1109/icsamos.2006.300821 dblp:conf/samos/MonchieroPSV06 fatcat:cu6537637na4vgk3bfthdjxuoe

Exploration of distributed shared memory architectures for NoC-based multiprocessors

Matteo Monchiero, Gianluca Palermo, Cristina Silvano, Oreste Villa
2007 Journal of systems architecture  
In this paper, a distributed shared memory architecture has been explored, that is suitable for low-power on-chip multiprocessors based on NoC.  ...  Experimental results show the impact of different NoC topologies and distributed shared memory configurations for a selected set of parallel benchmark applications from the power/performance perspective  ...  Interconnection The interconnect is the key element of the multiprocessor system, since it provides low latency communication layer, capable of minimizing the overhead due to thread spawning and synchronization  ... 
doi:10.1016/j.sysarc.2007.01.008 fatcat:6jjvd42x2vetdmai3ftipxlg5e

Shared memory computing on clusters with symmetric multiprocessors and system area networks

Leonidas Kontothanassis, Robert Stets, Galen Hunt, Umit Rencuzogullari, Gautam Altekar, Sandhya Dwarkadas, Michael L. Scott
2005 ACM Transactions on Computer Systems  
Experiments indicate that a one-level, version of the Cashmere protocol provides performance comparable to, or slightly better than, that of Tread-Marks' lazy release consistency.  ...  Kontothanassis et al. improves overall performance when care is taken to avoid interference with inter-node software coherence.  ...  The Shasta results reported in Section 4.2 were obtained with the generous assistance of Dan Scales and Kourosh Gharachorloo. The authors would like to thank Ricardo Bianchini and Alan L.  ... 
doi:10.1145/1082469.1082472 fatcat:itz3q5b2fbczhcgzkoszmabr5a

Integrating performance monitoring and communication in parallel computers

Margaret Martonosi, David Ofelt, Mark Heinrich
1996 Performance Evaluation Review  
We present results on the accuracy of the data collected, and on how FlashPoint performance scales with the number of processors.  ...  In fact, we argue that on several machines, the coherence/communication system itself can be used as machine support for performance monitoring.  ...  FlashPoint obtains detailed memory performance statistics at low overheads with good accuracy.  ... 
doi:10.1145/233008.233035 fatcat:n5jfpt3gvbhz7bmtxlkmxepk54

Integrating performance monitoring and communication in parallel computers

Margaret Martonosi, David Ofelt, Mark Heinrich
1996 Proceedings of the 1996 ACM SIGMETRICS international conference on Measurement and modeling of computer systems - SIGMETRICS '96  
We present results on the accuracy of the data collected, and on how FlashPoint performance scales with the number of processors.  ...  In fact, we argue that on several machines, the coherence/communication system itself can be used as machine support for performance monitoring.  ...  FlashPoint obtains detailed memory performance statistics at low overheads with good accuracy.  ... 
doi:10.1145/233013.233035 dblp:conf/sigmetrics/MartonosiOH96 fatcat:wurqfoveibdavck7kyosipplom

Disk caching with an optical ring

Enrique V. Carrera, Ricardo Bianchini
2000 Applied Optics  
Even though our study focuses on optimizing page swap outs, we believe that caching data with an optical ring can be beneficial for other types of disk-write traffic as well.  ...  To evaluate the extent to which these benefits affect performance, we use detailed execution-driven simulations of several out-of-core parallel applications that run on an eight-node scalable multiprocessor  ...  We would also like to thank Timothy Pinkston, Joon-Ho Ha, and Fredrik Dahlgren for their careful evaluation of our study and for discussions that helped improve this paper significantly.  ... 
doi:10.1364/ao.39.006663 pmid:18354681 fatcat:5k2pwihkpndvdalmzzavbafl2a

Dynamic program phase detection in distributed shared-memory multiprocessors

E. Ipek, J.F. Martinez, B.R. de Supinski, S.A. McKee, M. Schulz
2006 Proceedings 20th IEEE International Parallel & Distributed Processing Symposium  
We then propose a hardware extension to a well-known uniprocessor mechanism that significantly improves phase detection in the context of DSM multiprocessors.  ...  We present a novel hardware mechanism for dynamic program phase detection in distributed sharedmemory (DSM) multiprocessors.  ...  degrades quickly with the system size. • Propose and evaluate a low-overhead architectural mechanism that captures data distribution, latency, and contention effects of a DSM multiprocessor setting.  ... 
doi:10.1109/ipdps.2006.1639572 dblp:conf/ipps/IpekMSMS06 fatcat:blchagrmovc4pgmd3lmzvpeoxe

An efficient synchronization technique for multiprocessor systems on-chip

Matteo Monchiero, Gianluca Palermo, Cristina Silvano, Oreste Villa
2006 SIGARCH Computer Architecture News  
For an 8-processor target architecture, we show that the proposed solution achieves up to 40% performance improvement and 25% energy saving with respect to synchronization based on the caching of the synchronization  ...  We suggest the architecture of the memory controller optimized to minimize synchronization overhead.  ...  The interconnect is a key element of the system, since it provides low latency communication layer, capable of minimizing the overhead due to thread spawning and synchronization.  ... 
doi:10.1145/1147349.1147357 fatcat:glxyh2x3qbbodplpgu5icukz6q

SHARED MEMORY VERSUS MESSAGE PASSING FOR ITERATIVE SOLUTION OF SPARSE, IRREGULAR PROBLEMS

FREDERIC T. CHONG, ANANT AGARWAL
1999 Parallel Processing Letters  
The benefits of hardware support for shared memory versus those for message passing are difficult to evaluate without an in-depth study of real applications on a common platform.  ...  We evaluate the communication mechanisms of the MIT Alewife machine, a multiprocessor which provides integrated cache-coherent shared memory, message passing, and DMA.  ...  Aggregation has been a common approach on traditional multiprocessors with high-overhead communication.  ... 
doi:10.1142/s0129626499000177 fatcat:ja24n7c6w5ghjk3jfiq3dqmzgq

A quantitative analysis of the performance and scalability of distributed shared memory cache coherence protocols

M. Heinrich, V. Soundararajan, J. Hennessy, A. Gupta
1999 IEEE transactions on computers  
Although the performance of such multiprocessors depends critically on the performance of the cache coherence protocol, little comparative performance data is available.  ...  In addition to measurements of the characteristics of protocol execution (e.g. memory overhead, protocol execution time, and message count) and of overall performance, we examine the effects of scaling  ...  The authors wish to thank the FLASH team members as well as Robert Bosch for his tireless support of the simulation environment.  ... 
doi:10.1109/12.752662 fatcat:kforuwbdtbfmnarn7uqmivan2a

Limits on the performance benefits of multithreading and prefetching

Beng-Hong Lim, Ricardo Bianchini
1996 Performance Evaluation Review  
This paper presents new analytical models of the performance benefits of multithreading and prefetching, and experimental nleasurements of parallel applications on the MIT Alewife multiprocessor.  ...  The model also shows that multithreading can significantly improve the performance of the same applications in multiprocessors with longer latencies.  ...  Acknowledgments We would like to thank the members of the Alewife group, es-  ... 
doi:10.1145/233008.233021 fatcat:lfpb7yymwreyvjtr2twlizkapa

Limits on the performance benefits of multithreading and prefetching

Beng-Hong Lim, Ricardo Bianchini
1996 Proceedings of the 1996 ACM SIGMETRICS international conference on Measurement and modeling of computer systems - SIGMETRICS '96  
This paper presents new analytical models of the performance benefits of multithreading and prefetching, and experimental nleasurements of parallel applications on the MIT Alewife multiprocessor.  ...  The model also shows that multithreading can significantly improve the performance of the same applications in multiprocessors with longer latencies.  ...  Acknowledgments We would like to thank the members of the Alewife group, es-  ... 
doi:10.1145/233013.233021 dblp:conf/sigmetrics/LimB96 fatcat:dtf7p3jlnfffxa4cswtqhpksbu

Implementing Dense Linear Algebra Algorithms Using Multitasking on the CRAY X-MP-4 (or Approaching the Gigaflop)

Jack J. Dongarra, Tom Hewitt
1986 SIAM Journal on Scientific and Statistical Computing  
The editors are pleased to launch this section with a note on the use of a new computer organization that is likely to be the start of a revolution in scientific and statistical computation.  ...  This is the first paper to be published under the new "timely communications" policy for the SIAM Journal on Scientific and Statistical Computing.  ...  The versions of LU and Cholesky factorization, used here, are based on matrix-vector modules that allow for a high level of granularity, permitting high performance in a number of different environments  ... 
doi:10.1137/0907023 fatcat:t7qr47nvwjdjroggprc7pop2km
« Previous Showing results 1 — 15 out of 1,209 results