Filters








246 Hits in 3.5 sec

Focused prefetching

R. Manikantan, R. Govindarajan
2008 Proceedings of the 22nd annual international conference on Supercomputing - ICS '08  
Also we show that the criterion of focusing on commit stalls is robust enough across cache levels and can be applied to any prefetcher without any modifications to the prefetcher.  ...  We propose simple history-based classifiers that track commit stalls suffered by loads to help us identify this small set of loads. We study an application of these classifiers to prefetching.  ...  On an average, the gain in performance for commit stall based focused prefetching over criticality based focused prefetching is 4.6% while Figure 16 : Performance gains of Focused Prefetching over Criticality  ... 
doi:10.1145/1375527.1375576 dblp:conf/ics/ManikantanG08 fatcat:ax4fzxee6jffhgynegcyqvuupe

Targeted Data Prefetching [chapter]

Weng-Fai Wong
2005 Lecture Notes in Computer Science  
Our results show that our prefetch strategy can reduce up to 45% of stall cycles of benchmarks running on a simulated out-of-order superscalar processor with an overhead of 0.0005 prefetch per CPU cycle  ...  The success of any data prefetching scheme depends on three factors: timeliness, accuracy and overhead.  ...  It would be interesting to see if other prediction schemes, perhaps even ones that are uniquely designed for different applications so as to optimize area-performance, say, can benefit from it.  ... 
doi:10.1007/11572961_63 fatcat:qdjnboounngotjfjjx6yww6kde

Kilo-instruction processors, runahead and prefetching

Tanausú Ramírez, Alex Pajuelo, Oliverio J. Santana, Mateo Valero
2006 Proceedings of the 3rd conference on Computing frontiers - CF '06  
Runahead mechanism is another form of prefetching based on speculative execution.  ...  We show that Runahead execution achieves better performance speedups (30% on average) than traditional prefetch techniques (21% on average).  ...  Now, we show the performance when both Runahead execution and the Kilo-instruction processor are enhanced with a stride-based prefetcher.  ... 
doi:10.1145/1128022.1128059 dblp:conf/cf/RamirezPSV06 fatcat:4qur6t4fdra7tntpiiozuth55y

Call-chain Software Instruction Prefetching in J2EE Server Applications

Priya Nagpurkar, Harold W. Cain, Mauricio Serrano, Jong-Deok Choi, Chandra Krintz
2007 Parallel Architecture and Compilation Techniques (PACT), Proceedings of the International Conference on  
When running two J2EE benchmarks on WebSphere, we find that instruction cache misses cause a 12% performance penalty on currentgeneration Power5-based multiprocessor systems.  ...  To mitigate this performance loss, we describe a new call-chain based algorithm for inserting software prefetch instructions, and evaluate its potential for improved instruction cache performance.  ...  Acknowledgments We thank the anonymous reviewers for providing useful comments on this paper. This work was funded in part by IBM Research and NSF grants CCF-0444412 and CNS-0546737.  ... 
doi:10.1109/pact.2007.4336207 fatcat:j2mdpqenlnanhjxdoagmpg7wge

Exploiting the Role of Hardware Prefetchers in Multicore Processors

Hasina Khatoon, Shahid Hafeez, Talat Altaf
2013 International Journal of Advanced Computer Science and Applications  
prefetchers.  ...  Another aspect that is investigated is the performance of multicore processors using a multiprogram workload as compared to a single program workload while varying the configuration of the built-in hardware  ...  Manikantan and Govindarajan [24] have proposed performance-oriented prefetching enhancements that include focused prefetching to avoid commit stalls.  ... 
doi:10.14569/ijacsa.2013.040622 fatcat:z2vik33z5rbnjdkuzaxpiptcxu

Designing lab sessions focusing on real processors for computer architecture courses: A practical perspective

Josué Feliu, Julio Sahuquillo, Salvador Petit
2018 Journal of Parallel and Distributed Computing  
This approach is based on performing experiments on current commercial processors, where multiple hardware events related to the performance of the computer components under study are monitored.  ...  Lab sessions are mainly based on simulation frameworks because they benefit learning.  ...  ., committed instructions, cache misses, issue stalls, etc).  ... 
doi:10.1016/j.jpdc.2018.02.026 fatcat:apa6byknfbftddikyqlq226p3a

A framework for modeling and optimization of prescient instruction prefetch

Tor M. Aamodt, Pedro Marcuello, Paul Chow, Antonio Gonz?lez, Per Hammarlund, Hong Wang, John P. Shen
2003 Proceedings of the 2003 ACM SIGMETRICS international conference on Measurement and modeling of computer systems - SIGMETRICS '03  
application performance by performing judicious and timely instruction prefetch.  ...  The optimization of spawn-target pair selections is formulated by modeling program behavior as a Markov chain based on profile statistics.  ...  thank Murali Annavaram, Bob Colwell, Edward Grochowski, Steve (Shih-wei) Liao, James Psota, Ronny Ronen, Lesley Shannon, Perry Wang, Craig Zilles, and the anonymous referees for their valuable comments on  ... 
doi:10.1145/781027.781030 dblp:conf/sigmetrics/AamodtMCGHWS03 fatcat:rm5fipdqtjcxrfy5z3geuo6dzy

A framework for modeling and optimization of prescient instruction prefetch

Tor M. Aamodt, Pedro Marcuello, Paul Chow, Antonio González, Per Hammarlund, Hong Wang, John P. Shen
2003 Performance Evaluation Review  
application performance by performing judicious and timely instruction prefetch.  ...  The optimization of spawn-target pair selections is formulated by modeling program behavior as a Markov chain based on profile statistics.  ...  thank Murali Annavaram, Bob Colwell, Edward Grochowski, Steve (Shih-wei) Liao, James Psota, Ronny Ronen, Lesley Shannon, Perry Wang, Craig Zilles, and the anonymous referees for their valuable comments on  ... 
doi:10.1145/885651.781030 fatcat:jnvmih7mifb7xpyv4lkukzym4e

A framework for modeling and optimization of prescient instruction prefetch

Tor M. Aamodt, Pedro Marcuello, Paul Chow, Antonio Gonz?lez, Per Hammarlund, Hong Wang, John P. Shen
2003 Proceedings of the 2003 ACM SIGMETRICS international conference on Measurement and modeling of computer systems - SIGMETRICS '03  
application performance by performing judicious and timely instruction prefetch.  ...  The optimization of spawn-target pair selections is formulated by modeling program behavior as a Markov chain based on profile statistics.  ...  thank Murali Annavaram, Bob Colwell, Edward Grochowski, Steve (Shih-wei) Liao, James Psota, Ronny Ronen, Lesley Shannon, Perry Wang, Craig Zilles, and the anonymous referees for their valuable comments on  ... 
doi:10.1145/781028.781030 fatcat:abr2ae6q3zdijhn7jgsx6lfzw4

A Customized Processor for Energy Efficient Scientific Computing

Ankit Sethia, Ganesh Dasika, Trevor Mudge, Scott Mahlke
2012 IEEE transactions on computers  
It is now possible to assemble a system that provides several TFLOPs of performance on scientific applications for the cost of a high-end laptop computer.  ...  PEPSC utilizes a combination of a 2D single-instruction multiple-data (SIMD) datapath, an intelligent dynamic prefetching mechanism, and a configurable SIMD control approach to increase execution efficiency  ...  We also thank Gaurav Chadha and Wade Walker for providing feedback on this work. This research was supported by the US National Science Foundation grant CNS-0964478 and ARM Ltd.  ... 
doi:10.1109/tc.2012.144 fatcat:6wb7y7femfftlh5geqsqbm37wy

Clearing the clouds

Michael Ferdman, Babak Falsafi, Almutaz Adileh, Onur Kocberber, Stavros Volos, Mohammad Alisafaee, Djordje Jevdjic, Cansu Kaynak, Adrian Daniel Popescu, Anastasia Ailamaki
2012 Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS '12  
Processor real-estate and power are misspent on large last-level caches that do not contribute to improved scale-out workload performance.  ...  We use performance counters on modern servers to study scale-out workloads, finding that today's predominant processor micro-architecture is inefficient for running these workloads.  ...  We classify each cycle of execution as Committing if at least one instruction was committed during that cycle or as Stalled otherwise.  ... 
doi:10.1145/2150976.2150982 dblp:conf/asplos/FerdmanAKVAJKPAF12 fatcat:z37fymq7dzgzxhnrwjudviuzwi

Clearing the clouds

Michael Ferdman, Babak Falsafi, Almutaz Adileh, Onur Kocberber, Stavros Volos, Mohammad Alisafaee, Djordje Jevdjic, Cansu Kaynak, Adrian Daniel Popescu, Anastasia Ailamaki
2012 SIGARCH Computer Architecture News  
Processor real-estate and power are misspent on large last-level caches that do not contribute to improved scale-out workload performance.  ...  We use performance counters on modern servers to study scale-out workloads, finding that today's predominant processor micro-architecture is inefficient for running these workloads.  ...  We classify each cycle of execution as Committing if at least one instruction was committed during that cycle or as Stalled otherwise.  ... 
doi:10.1145/2189750.2150982 fatcat:26l7woyutjhodbffqiidze5i2e

Improving memory scheduling via processor-side load criticality information

Saugata Ghose, Hyodong Lee, José F. Martínez
2013 SIGARCH Computer Architecture News  
In this paper we propose one such mechanism, pairing up a processor-side load criticality predictor with a lean memory controller that prioritizes load requests based on ranking information supplied from  ...  Using a sophisticated multicore simulator that includes a detailed quad-channel DDR3 DRAM model, we demonstrate that this mechanism can improve performance significantly on a CMP, with minimal overhead  ...  Based on Subramaniam et al.  ... 
doi:10.1145/2508148.2485930 fatcat:my7cmcjcvjcxbbzd324gvs63f4

Quantifying the Mismatch between Emerging Scale-Out Applications and Modern Processors

Michael Ferdman, Babak Falsafi, Almutaz Adileh, Onur Kocberber, Stavros Volos, Mohammad Alisafaee, Djordje Jevdjic, Cansu Kaynak, Adrian Daniel Popescu, Anastasia Ailamaki
2012 ACM Transactions on Computer Systems  
We use performance counters on modern servers to study scale-out workloads, finding that today's predominant processor microarchitecture is inefficient for running these workloads.  ...  We classify each cycle of execution as Committing if at least one instruction was committed during that cycle or as Stalled otherwise.  ...  We present executiontime breakdown results based on the performance counters that have no overlap.  ... 
doi:10.1145/2382553.2382557 fatcat:huy2nlmwibftnbrk32z77noowq

Improving memory scheduling via processor-side load criticality information

Saugata Ghose, Hyodong Lee, José F. Martínez
2013 Proceedings of the 40th Annual International Symposium on Computer Architecture - ISCA '13  
In this paper we propose one such mechanism, pairing up a processor-side load criticality predictor with a lean memory controller that prioritizes load requests based on ranking information supplied from  ...  Using a sophisticated multicore simulator that includes a detailed quad-channel DDR3 DRAM model, we demonstrate that this mechanism can improve performance significantly on a CMP, with minimal overhead  ...  Based on Subramaniam et al.  ... 
doi:10.1145/2485922.2485930 dblp:conf/isca/GhoseLM13 fatcat:qkrlisgjxjf2tm2y343xctlvue
« Previous Showing results 1 — 15 out of 246 results