Filters








230 Hits in 6.1 sec

The Cray BlackWidow

Dennis Abts, Abdulla Bataineh, Steve Scott, Greg Faanes, Jim Schwarzmeier, Eric Lundberg, Tim Johnson, Mike Bye, Gerald Schwoerer
2007 Proceedings of the 2007 ACM/IEEE conference on Supercomputing - SC '07  
The system supports thousands of outstanding references to hide remote memory latencies, and provides a rich suite of built-in synchronization primitives.  ...  Global memory is directly accessible with processor loads and stores and is globally coherent.  ...  Acknowledgments We would like to thank the entire BlackWidow development, applications, and benchmarking teams.  ... 
doi:10.1145/1362622.1362646 dblp:conf/sc/AbtsBSFSLJBS07 fatcat:rryh3wefojaqjov3hmkshdf3na

SQRL

Snehasish Kumar, Arrvindh Shriraman, Vijayalakshmi Srinivasan, Dan Lin, Jordon Phillips
2014 Proceedings of the 23rd international conference on Parallel architectures and compilation - PACT '14  
Unfortunately, the narrow load/store interfaces of general-purpose processors are not efficient for data structure traversals leading to wasteful instructions, low memory level parallelism, and energy  ...  Many recent research proposals have explored energy efficient accelerators that customize hardware for compute kernels while still relying on conventional load/store interfaces for data delivery.  ...  Current load/store queue interfaces are unable to make effective use of available memory bandwidth.  ... 
doi:10.1145/2628071.2628118 dblp:conf/IEEEpact/KumarSS0P14 fatcat:kpikhzfx7zh37io3hofzkce2ii

Design and exploitation of a high-performance SIMD floating-point unit for Blue Gene/L

S. Chatterjee, L. R. Bachega, P. Bergner, K. A. Dockser, J. A. Gunnels, M. Gupta, F. G. Gustavson, C. A. Lapkowski, G. K. Liu, M. Mendell, R. Nair, C. D. Wait (+2 others)
2005 IBM Journal of Research and Development  
We discuss the hardware and software codesign that was essential in order to fully realize the performance benefits of the FPU when constrained by the memory bandwidth limitations and high penalties for  ...  , and delivers a significant fraction of peak memory bandwidth for memorybound kernels, such as DAXPY, while remaining largely insensitive to data alignment.  ...  Acknowledgments The Blue Gene/L project has been supported and partially funded by the Lawrence Livermore National Laboratory on behalf of the United States Department of Energy under Lawrence Livermore  ... 
doi:10.1147/rd.492.0377 fatcat:qqq365366fhxno5bwaleitoika

Automated data cache placement for embedded VLIW ASIPs

Paul Morgan, Richard Taylor, Japheth Hossell, George Bruce, Barry O'Rourke
2005 Proceedings of the 3rd IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis - CODES+ISSS '05  
We present a solution that greatly simplifies the creation of targeted caches and automates the process of explicitly allocating individual memory access to caches and banks.  ...  Memory bandwidth issues present a formidable bottleneck to accelerating embedded applications, particularly data bandwidth for multiple-issue VLIW processors.  ...  Thus it is necessary to instantiate and effectively utilize data cache units that allow multiple concurrent accesses to maximize data bandwidth.  ... 
doi:10.1145/1084834.1084849 dblp:conf/codes/MorganTHBO05 fatcat:sv43h42jqrglhlwfwrciltu3dm

Distributed vector architectures

Stefanos Kaxiras
2000 Journal of systems architecture  
With timing simulations we show that a DIVA system with 2 to 8 nodes is up to three times faster than a single node using its local memory as a large cache and can even outperform a hypothetical system  ...  Such integration provides very high memory bandwidth that can be exploited efficiently by vector operations.  ...  Our results also show that: (i) given multiple memory banks and high internal bandwidth, memory latency is not critical parameter for performance because of the ability of the vector units to tolerate  ... 
doi:10.1016/s1383-7621(00)00003-5 fatcat:vkhklce7xrftvon5j4hhl3libi

Vector architectures

Roger Espasa, Mateo Valero, James E. Smith
1998 Proceedings of the 12th international conference on Supercomputing - ICS '98  
Vector architectures have long been the of choice for building supercomputers. They first appeared in the early aeven-  ...  Acknowledgments We would like to thank Francisca Quintana for providing the data on the RlOOOO instruction and operation numbers.  ...  was a bidirectional load/store the most notable exceptions).  ... 
doi:10.1145/277830.277935 dblp:conf/ics/EspasaVS98 fatcat:w4ha6nojszaqrojvdg6bwytnw4

Missing the memory wall

Ashley Saulsbury, Fong Pong, Andreas Nowatzyk
1996 SIGARCH Computer Architecture News  
We present results from execution driven uni-and multi-processor simulations showing that the benefits of lower latency and higher bandwidth can compensate for the restrictions on the size and complexity  ...  Current high performance computer systems use complex, large superscalar CPUs that interface to the main memory through a hierarchy of caches and interconnect systems.  ...  Acknowledgments The authors would like to acknowledge the valuable help, feedback and inspiration they received from Gunes Aybay, Clement Fang, Howard Davidson, Mark Hill, Sally McKee, William Radke, Eugen  ... 
doi:10.1145/232974.232984 fatcat:w5c3hi3725dpdpc76725f5pyqq

Missing the memory wall

Ashley Saulsbury, Fong Pong, Andreas Nowatzyk
1996 Proceedings of the 23rd annual international symposium on Computer architecture - ISCA '96  
We present results from execution driven uni-and multi-processor simulations showing that the benefits of lower latency and higher bandwidth can compensate for the restrictions on the size and complexity  ...  Current high performance computer systems use complex, large superscalar CPUs that interface to the main memory through a hierarchy of caches and interconnect systems.  ...  Acknowledgments The authors would like to acknowledge the valuable help, feedback and inspiration they received from Gunes Aybay, Clement Fang, Howard Davidson, Mark Hill, Sally McKee, William Radke, Eugen  ... 
doi:10.1145/232973.232984 dblp:conf/isca/SaulsburyPN96 fatcat:ut72ah2zxzh73onrac3vems5aq

CRIB

Erika Gunadi, Mikko H. Lipasti
2011 Proceeding of the 38th annual international symposium on Computer architecture - ISCA '11  
With the removal of speculative scheduling, various cache optimizations that were not attractive for conventional machine due to the added latency non-determinism can be employed.  ...  Adding cache optimizations further increases the energy saving to 80% and 77% respectively. Assuming that the front-end, clock-tree, and L2 cache consume 50% of total core power  ...  With cache banking, a cache access only takes 0.5 and 1.5 cycles for address generation and the cache access, effectively reducing the load-to-use latency.  ... 
doi:10.1145/2000064.2000068 dblp:conf/isca/GunadiL11 fatcat:4eghzwmddnedbh7nvkezgi422u

Cache Where you Want! Reconciling Predictability and Coherent Caching [article]

Ayoosh Bansal, Jayati Singh, Yifan Hao, Jen-Yang Wen, Renato Mancuso,, Marco Caccamo
2019 arXiv   pre-print
Large fluctuations in latency to access data shared between multiple cores is an important contributor to the overall execution-time variability.  ...  The rationale is that by caching data only in shared cache it is possible to bypass private caches. The access latency to data present in caches becomes independent of its coherence state.  ...  Any opinions, findings, and conclusions or recommendations expressed in this publication are those of the authors and do not necessarily reflect the views of the NSF.  ... 
arXiv:1909.05349v1 fatcat:j3uly3xdv5fdzdeo2afh2xuglu

A unified vector/scalar floating-point architecture

N. P. Jouppi, J. Bertoni, D. W. Wall
1989 SIGARCH Computer Architecture News  
The long-term goal of WRL is to aid and accelerate the development of high-performance uni-and multi-processors.  ...  Researchers at WRL cooperate closely and move freely among the various levels of system design. This allows us to explore a wide range of tradeoffs to meet system goals.  ...  Acknowledgements The authors wish to thank Michael L. Powell for helpful discussions, his encouragement, and Linpack simulations. Mike Nielsen contributed the graphics transform simulations.  ... 
doi:10.1145/68182.68195 fatcat:kfgtjfkd65e5hckgbntrkcsi3y

A unified vector/scalar floating-point architecture

N. P. Jouppi, J. Bertoni, D. W. Wall
1989 Proceedings of the third international conference on Architectural support for programming languages and operating systems - ASPLOS-III  
The long-term goal of WRL is to aid and accelerate the development of high-performance uni-and multi-processors.  ...  Researchers at WRL cooperate closely and move freely among the various levels of system design. This allows us to explore a wide range of tradeoffs to meet system goals.  ...  Acknowledgements The authors wish to thank Michael L. Powell for helpful discussions, his encouragement, and Linpack simulations. Mike Nielsen contributed the graphics transform simulations.  ... 
doi:10.1145/70082.68195 dblp:conf/asplos/JouppiBW89 fatcat:ngszz37rqfdvdidajsa2scv6o4

DataScalar architectures

Doug Burger, Stefanos Kaxiras, James R. Goodman
1997 Proceedings of the 24th annual international symposium on Computer architecture - ISCA '97  
Our intuition and results show that DataScalar architectures work best with codes for which traditional parallelization techniques fail.  ...  In our simulated implementation, six unmodified SPEC95 binaries ran from 7% slower to 50% faster on two nodes, and from 9% to 100% faster on four nodes, than on a system with a comparable, more traditional  ...  Vijaykumar, David Wood, and Todd Bezenek for their helpful discussions and comments on drafts of this paper. We also thank Todd Austin, who developed the original SimpleScalar tool set.  ... 
doi:10.1145/264107.264215 dblp:conf/isca/BurgerKG97 fatcat:uqpa6bqnoneopjamcnstnmp3fi

DataScalar architectures

Doug Burger, Stefanos Kaxiras, James R. Goodman
1997 SIGARCH Computer Architecture News  
Our intuition and results show that DataScalar architectures work best with codes for which traditional parallelization techniques fail.  ...  In our simulated implementation, six unmodified SPEC95 binaries ran from 7% slower to 50% faster on two nodes, and from 9% to 100% faster on four nodes, than on a system with a comparable, more traditional  ...  Vijaykumar, David Wood, and Todd Bezenek for their helpful discussions and comments on drafts of this paper. We also thank Todd Austin, who developed the original SimpleScalar tool set.  ... 
doi:10.1145/384286.264215 fatcat:5fuljppb7ncajjaclnc2an6rue

Time-predictable chip-multiprocessor design

Martin Schoeberl
2010 2010 Conference Record of the Forty Fourth Asilomar Conference on Signals, Systems and Computers  
To provide a better scalability, more local memories have to be used. We add a processor local scratchpad memory and split data caches, which are still time-predictable, to the processor cores.  ...  The proposed chip-multiprocessor (CMP) uses a shared memory with a time-division multiple access (TDMA) based memory access scheduling.  ...  a first version of the split-cache design.  ... 
doi:10.1109/acssc.2010.5757923 fatcat:4fklqnckwbekxltbqn63hpn5n4
« Previous Showing results 1 — 15 out of 230 results