Filters








570 Hits in 5.1 sec

A Performance Comparison of DRAM Memory System Optimizations for SMT Processors

Zhichun Zhu, Zhao Zhang
11th International Symposium on High-Performance Computer Architecture  
In this study, we thoroughly evaluate contemporary multi-channel DDR SDRAM and Rambus DRAM systems in SMT systems, and search for new thread-aware DRAM optimization techniques.  ...  Memory system optimizations have been well studied on single-threaded systems; however, the wide use of simultaneous multithreading (SMT) techniques raises questions over their effectiveness in the new  ...  Acknowledgment: We would like to thank the anonymous referees for their constructive criticism and insightful suggestions which helped us improve the paper.  ... 
doi:10.1109/hpca.2005.2 dblp:conf/hpca/ZhuZ05 fatcat:ekrfepjvanf7peugt52xydq6ga

Memory scheduling for modern microprocessors

Ibrahim Hur, Calvin Lin
2007 ACM Transactions on Computer Systems  
The need to carefully schedule memory operations has increased as memory performance has become increasingly important to overall system performance.  ...  about the delays associated with its scheduling decisions, (2) it provides a mechanism for combining multiple constraints, which is important for increasingly complex DRAM structures, and (3) it allows  ...  We also thank Bill Mark, E Lewis, and the anonymous referees for their valuable comments on previous drafts of this article.  ... 
doi:10.1145/1314299.1314301 fatcat:5bu4kqy2ojbstkgohfljmujq3e

Data forwarding through in-memory precomputation threads

Wessam Hassanein, José Fortes, Rudolf Eigenmann
2004 Proceedings of the 18th annual international conference on Supercomputing - ICS '04  
To evaluate IMPT, we use a cycle-accurate simulation of an aggressive out-oforder processor with accurate simulation of bus and memory contention.  ...  The results show a performance gain of up to 1.47 (1.21 on average) over an aggressive superscalar processor. The average load access latency decreases by up to 55% (32% on average).  ...  For comparison, we also study a slower memory processor. The latency of a memory access by the memory-processor is the same as for ESDRAM [8] .  ... 
doi:10.1145/1006209.1006239 dblp:conf/ics/HassaneinGE04 fatcat:cmbng5nw6zbexfz7vcixjxplge

Performance Characterization of Multi-threaded Graph Processing Applications on Many-Integrated-Core Architecture

Lei Jiang, Langshi Chen, Judy Qiu
2018 2018 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)  
At last, We suggest future works including system auto-tuning tools and graph framework optimizations to fully exploit the potential of KNL for parallel graph processing.  ...  offering the lowest local memory access latency hurts the performance of graph benchmarks that are lack of NUMA awareness.  ...  For TC, 64 OoO cores with SMT support easily saturate the low DDR4 DRAM bandwidth by generating a large number of cache misses every cycle.  ... 
doi:10.1109/ispass.2018.00033 dblp:conf/ispass/JiangCQ18 fatcat:ohonoijtgvh37gpk63be4jsrqi

Evaluating architecture impact on system energy efficiency

Shijie Yu, Hailong Yang, Rui Wang, Zhongzhi Luan, Depei Qian, Xiaosong Hu
2017 PLoS ONE  
for High Performance Computing (HPC) and datacenter environment hosting tens of thousands of servers.  ...  the energy efficiency significantly; 2) for multithreaded application such as the Princeton Application Repository for Shared-Memory Computers (PARSEC), most of the workloads benefit a notable increase  ...  [15] illustrate RaT, an interesting design choice for SMT processor that would influence the way in which future SMT processors balance resource usage between ILP and memory-bound threads.  ... 
doi:10.1371/journal.pone.0188428 pmid:29161317 pmcid:PMC5697812 fatcat:dgvz6oyixfc2bmjjovd47piu5y

DRAM-Level Prefetching for Fully-Buffered DIMM: Design, Performance and Power Saving

Jiang Lin, Hongzhong Zheng, Zhichun Zhu, Zhao Zhang, Howard David
2007 2007 IEEE International Symposium on Performance Analysis of Systems & Software  
We have studied DRAM-level prefetching for the fully buffered DIMM (FB-DIMM) designed for multi-core processors.  ...  We have found that the performance gain comes from the reduction of idle memory latency and the improvement of channel bandwidth utilization.  ...  Acknowledgment We appreciate the constructive comments from the anonymous reviewers and thank Bruce Christenson at Intel for his critical comments.  ... 
doi:10.1109/ispass.2007.363740 dblp:conf/ispass/LinZZZD07 fatcat:ut7blyqgqjgehpjlhat3325dcu

Improving Operational Intensity in Data Bound Markov Chain Monte Carlo

Balazs Nemeth, Tom Haber, Thomas J. Ashby, Wim Lamotte
2017 Procedia Computer Science  
Performance improvements are shown for Bayesian logistic regression with a Markov chain Monte Carlo sampler, either with multiple chains or with multiple proposals, on a dense data set two orders of magnitude  ...  Performance improvements are shown for Bayesian logistic regression with a Markov chain Monte Carlo sampler, either with multiple chains or with multiple proposals, on a dense data set two orders of magnitude  ...  The test system had an Intel E5-2690v2 processor with 10 cores and 32 GB of memory.  ... 
doi:10.1016/j.procs.2017.05.024 fatcat:ihgigk7d3bc6rbrvern6qo5hqu

Architectural optimizations for low-power, real-time speech recognition

Rajeev Krishna, Scott Mahlke, Todd Austin
2003 Proceedings of the international conference on Compilers, architectures and synthesis for embedded systems - CASES '03  
Our results show that a simple, multi-threaded, multi-pipelined processor architecture can significantly improve the performance of the timeconsuming search phase of modern speech recognition algorithms  ...  The computational demands of robust, large vocabulary speech recognition systems, however, are currently prohibitive for such low power devices.  ...  The second is a larger, more sophisticated memory bus, interfacing to a standard DRAM memory system.  ... 
doi:10.1145/951710.951740 dblp:conf/cases/KrishnaMA03 fatcat:ku36nrbuhfayjgnztwozzuxtqu

Architectural optimizations for low-power, real-time speech recognition

Rajeev Krishna, Scott Mahlke, Todd Austin
2003 Proceedings of the international conference on Compilers, architectures and synthesis for embedded systems - CASES '03  
Our results show that a simple, multi-threaded, multi-pipelined processor architecture can significantly improve the performance of the timeconsuming search phase of modern speech recognition algorithms  ...  The computational demands of robust, large vocabulary speech recognition systems, however, are currently prohibitive for such low power devices.  ...  The second is a larger, more sophisticated memory bus, interfacing to a standard DRAM memory system.  ... 
doi:10.1145/951736.951740 fatcat:y2tf3wi46vcefb444akmmk464u

Integrated Memory Controllers with Parallel Coherence Streams

Mainak Chaudhuri, Mark Heinrich
2007 IEEE Transactions on Parallel and Distributed Systems  
Only for a special class of DSM machines employing directoryless broadcast protocols over unordered interconnects does parallel "snoop" processing offer reasonable performance improvement for communication-intensive  ...  However, with recent architectural trends toward integrated (on-chip) memory controllers and the well-known fact that processor frequency is increasing more rapidly than memory systems', we must ask whether  ...  The authors extend special thanks to the Security Center of IIT Kanpur for offering a quad Opteron used to run some of the simulations.  ... 
doi:10.1109/tpds.2007.1044 fatcat:p5yn5rkdmvcw3kqioiqbzo2kha

Prefetching with Helper Threads for Loosely Coupled Multiprocessor Systems

Jaejin Lee, Changhee Jung, Daeseob Lim, Yan Solihin
2009 IEEE Transactions on Parallel and Distributed Systems  
Evaluating our scheme with nine memory-intensive applications with the memory processor in DRAM achieves an average speedup of 1.25.  ...  This paper presents a helper thread prefetching scheme that is designed to work on loosely coupled processors, such as in a standard chip multiprocessor (CMP) system or an intelligent memory system.  ...  Fig. 2a shows the architecture of a system that integrates the memory processor in the DRAM chips or in the memory module (e.g., DIMM).  ... 
doi:10.1109/tpds.2008.224 fatcat:dekoh4fecrgznpn6py3nnzlz3m

A single-chip multiprocessor

B.A. Nayfeh, K. Olukotun
1997 Computer  
Memory A 12-issue superscalar or SMT processor can place large demands on the memory system.  ...  /SMT memory system.  ... 
doi:10.1109/2.612253 fatcat:l645n6krxnaphalnk5w6pogwye

Architectural Impact on Performance of In-memory Data Analytics: Apache Spark Case Study [article]

Ahsan Javed Awan, Mats Brorsson, Vladimir Vlassov, Eduard Ayguade
2016 arXiv   pre-print
We compare micro-architectural performance of batch processing and stream processing workloads in Apache Spark using hardware performance counters on a dual socket server.  ...  to DRAM.  ...  We thank Ananya Muddukrishna for their comments on the first draft of the paper. We also thank the anonymous reviewers for their constructive feedback  ... 
arXiv:1604.08484v1 fatcat:tp3yp5g32nek3d2ndr73vaavf4

Improving Latency Tolerance of Network Processors Through Simultaneous Multithreading [chapter]

Bo Liang, Hong An, Fang Lu, Rui Guo
2005 Lecture Notes in Computer Science  
Multiple PEs, each of which is a multithreaded processor core, process several packets in parallel to hide long memory access latency. Most of them are optimized for throughputs mostly in data-plane.  ...  We show in this paper that 2~4 issue SMT provides an excellent short memory and branch latency tolerance, which gain higher instructions throughout as well as much simpler structures.  ...  Test Result The Memory Latency Hiding Effectiveness of SMT Superscalar Core For the purpose of comparison, the memory access latency tolerance of the superscalar architecture is firstly investigated  ... 
doi:10.1007/11573937_9 fatcat:arqg3qyuibe45hfp4iajjz6z3y

A comprehensive approach to DRAM power management

Ibrahim Hur, Calvin Lin
2008 High-Performance Computer Architecture  
This paper describes a comprehensive approach for using the memory controller to improve DRAM energy efficiency and manage DRAM power.  ...  We make three contributions: (1) we describe a simple power-down policy for exploiting low power modes of modern DRAMs; (2) we show how the idea of adaptive history-based memory schedulers can be naturally  ...  We thank Alper Buyuktosunoglu for his helpful expertise on power consumption. We thank the entire IBM Power5 team, in particular, Cheryl Chunco, Steve Dodson, Gary Morrison, Stephen J.  ... 
doi:10.1109/hpca.2008.4658648 dblp:conf/hpca/HurL08 fatcat:llxlsmnoqndhnpprsut5dtmuzi
« Previous Showing results 1 — 15 out of 570 results