Filters








20 Hits in 7.5 sec

Memory latency-tolerance approaches for Itanium processors: out-of-order execution vs. speculative precomputation

P.H. Wang, Hong Wang, J.D. Collins, E. Grochowski, R.M. Kling, J.P. Shen
Proceedings Eighth International Symposium on High Performance Computer Architecture  
The performance of in-order execution Itanium TM processors can suffer significantly due to cache misses. Two memory latency tolerance approaches can be applied for the Itanium processors.  ...  For a select set of memory-intensive programs, an in-order SMT Itanium processor using speculative precomputation can achieve performance improvement (92%) comparable to that of an outof-order design (  ...  Acknowledgment The authors would like to thank Justin Rattner and Dean Tullsen for their support, and the anonymous reviewers for their comments.  ... 
doi:10.1109/hpca.2002.995709 dblp:conf/hpca/WangWCGKS02 fatcat:ecs45tjblvfrboknekecphqk3e

Post-pass binary adaptation for software-based speculative precomputation

Steve S.W. Liao, Perry H. Wang, Hong Wang, Gerolf Hoflehner, Daniel Lavery, John P. Shen
2002 Proceedings of the ACM SIGPLAN 2002 Conference on Programming language design and implementation - PLDI '02  
The tool is based on the speculative precomputation (SP) paradigm for future Itanium TM processors [16] .  ...  out-oforder processor.  ...  We appreciate the helpful suggestions from the referees for this conference.  ... 
doi:10.1145/512529.512544 dblp:conf/pldi/LiaoWWSHL02 fatcat:hidw6rtajfd5vffdg2havbqi4e

Post-pass binary adaptation for software-based speculative precomputation

Steve S.W. Liao, Perry H. Wang, Hong Wang, Gerolf Hoflehner, Daniel Lavery, John P. Shen
2002 SIGPLAN notices  
The tool is based on the speculative precomputation (SP) paradigm for future Itanium TM processors [16] .  ...  out-oforder processor.  ...  We appreciate the helpful suggestions from the referees for this conference.  ... 
doi:10.1145/543552.512544 fatcat:5mi2fvfb3bcinfnxcbu3gglhue

Post-pass binary adaptation for software-based speculative precomputation

Steve S.W. Liao, Perry H. Wang, Hong Wang, Gerolf Hoflehner, Daniel Lavery, John P. Shen
2002 Proceedings of the ACM SIGPLAN 2002 Conference on Programming language design and implementation - PLDI '02  
The tool is based on the speculative precomputation (SP) paradigm for future Itanium TM processors [16] .  ...  out-oforder processor.  ...  We appreciate the helpful suggestions from the referees for this conference.  ... 
doi:10.1145/512541.512544 fatcat:nshqi5hd4rh2jbrrzpodutdw5y

Study and evaluation of an Irregular Graph Algorithm on Multicore and GPU Processor Architectures [article]

Varun Nagpal
2016 arXiv   pre-print
Since the gap between processor and memory performance continues to exist, difficulty to hide and decrease this gap is one of the important factors which results in poor performance of these applications  ...  Such applications have very little computation and unpredictable memory access patterns making them memory-bound in contrast to compute-bound applications.  ...  Multiple instructions can be issued or/and executed either in-order or out-of-order(O-o-O).  ... 
arXiv:1603.02655v1 fatcat:nklt3op66vdfdpmd3ygeckhwla

Data prefetching by dependence graph precomputation

Murali Annavaram, Jignesh M. Patel, Edward S. Davidson
2001 Proceedings of the 28th annual international symposium on Computer architecture - ISCA '01  
A separate precomputation engine executes these graphs to generate the data addresses of the marked load/store instructions early enough for accurate prefetching.  ...  Prefetching data by predicting the miss address is one way to tolerate the cache miss latencies.  ...  We would like to thank Josef Burger for providing us a version of SHORE that runs on Alpha machines, and Steve Reinhardt for his suggestions and for graciously allowing us to use his Alpha machines.  ... 
doi:10.1145/379240.379251 dblp:conf/isca/AnnavaramPD01 fatcat:y5fh24ntzbhd7kuucz52ltpqeu

Data prefetching by dependence graph precomputation

Murali Annavaram, Jignesh M. Patel, Edward S. Davidson
2001 SIGARCH Computer Architecture News  
A separate precomputation engine executes these graphs to generate the data addresses of the marked load/store instructions early enough for accurate prefetching.  ...  Prefetching data by predicting the miss address is one way to tolerate the cache miss latencies.  ...  We would like to thank Josef Burger for providing us a version of SHORE that runs on Alpha machines, and Steve Reinhardt for his suggestions and for graciously allowing us to use his Alpha machines.  ... 
doi:10.1145/384285.379251 fatcat:bliermeegbfmvcmad4nnyvj4sa

A framework for modeling and optimization of prescient instruction prefetch

Tor M. Aamodt, Pedro Marcuello, Paul Chow, Antonio González, Per Hammarlund, Hong Wang, John P. Shen
2003 Performance Evaluation Review  
This algorithm has been implemented, and evaluated for the Itanium Processor Family architecture.  ...  A limit study finds 4.8% to 17% speedups on an in-order simultaneous multithreading processor with eight contexts, over nextline and streaming I-prefetch for a set of benchmarks with high Icache miss rates  ...  ACKNOWLEDGEMENTS We would like to thank Murali Annavaram, Bob Colwell, Edward Grochowski, Steve (Shih-wei) Liao, James Psota, Ronny Ronen, Lesley Shannon, Perry Wang, Craig Zilles, and the anonymous referees for  ... 
doi:10.1145/885651.781030 fatcat:jnvmih7mifb7xpyv4lkukzym4e

A framework for modeling and optimization of prescient instruction prefetch

Tor M. Aamodt, Pedro Marcuello, Paul Chow, Antonio Gonz?lez, Per Hammarlund, Hong Wang, John P. Shen
2003 Proceedings of the 2003 ACM SIGMETRICS international conference on Measurement and modeling of computer systems - SIGMETRICS '03  
This algorithm has been implemented, and evaluated for the Itanium Processor Family architecture.  ...  A limit study finds 4.8% to 17% speedups on an in-order simultaneous multithreading processor with eight contexts, over nextline and streaming I-prefetch for a set of benchmarks with high Icache miss rates  ...  ACKNOWLEDGEMENTS We would like to thank Murali Annavaram, Bob Colwell, Edward Grochowski, Steve (Shih-wei) Liao, James Psota, Ronny Ronen, Lesley Shannon, Perry Wang, Craig Zilles, and the anonymous referees for  ... 
doi:10.1145/781027.781030 dblp:conf/sigmetrics/AamodtMCGHWS03 fatcat:rm5fipdqtjcxrfy5z3geuo6dzy

A framework for modeling and optimization of prescient instruction prefetch

Tor M. Aamodt, Pedro Marcuello, Paul Chow, Antonio Gonz?lez, Per Hammarlund, Hong Wang, John P. Shen
2003 Proceedings of the 2003 ACM SIGMETRICS international conference on Measurement and modeling of computer systems - SIGMETRICS '03  
This algorithm has been implemented, and evaluated for the Itanium Processor Family architecture.  ...  A limit study finds 4.8% to 17% speedups on an in-order simultaneous multithreading processor with eight contexts, over nextline and streaming I-prefetch for a set of benchmarks with high Icache miss rates  ...  ACKNOWLEDGEMENTS We would like to thank Murali Annavaram, Bob Colwell, Edward Grochowski, Steve (Shih-wei) Liao, James Psota, Ronny Ronen, Lesley Shannon, Perry Wang, Craig Zilles, and the anonymous referees for  ... 
doi:10.1145/781028.781030 fatcat:abr2ae6q3zdijhn7jgsx6lfzw4

Optimal Global Instruction Scheduling for the Itanium® Processor Architecture [article]

Sebastian Winkel, Universität Des Saarlandes, Universität Des Saarlandes
2005
It can be tolerated better by out-of-order designs, which can rearrange the schedule on a cache miss at runtime (although with limitations, as described above).  ...  Technically similar instructions that execute on the same units with the same latency on Itanium processors are arranged in groups.  ...  Table 7 .3: Results of the optimization: Used speculation.  ... 
doi:10.22028/d291-25795 fatcat:bdksgovmnjgjpkoui5axfjkxja

Predictive analysis and optimisation of pipelined wavefront computations

G.R. Mudalig, S.D. Hammond, J.A. Smith, S.A. Jarvis
2009 2009 IEEE International Symposium on Parallel & Distributed Processing  
In order to aid the design and optimisation of these applications, and to ensure that during procurement platforms are chosen best suited to these codes, there has been considerable research in analysing  ...  Daniel Spooner for acting as my second supervisor, particularly for his advice during the early years of my degree.  ...  Other examples for ILP methods are superscalar execution, where multiple execution units are used to process instructions in parallel, out-of-order execution of instructions, speculative execution, in  ... 
doi:10.1109/ipdps.2009.5160882 dblp:conf/ipps/MudaligeHSJ09 fatcat:gddundkjjjcvzjwqhfjywn3rzm

Hardware Support for Prescient Instruction Prefetch

T.M. Aamodt, P. Chow, P. Hammarlund, Hong Wang, J.P. Shen
10th International Symposium on High Performance Computer Architecture (HPCA'04)  
On a research Itanium® SMT processor with next line and streaming I-prefetch mechanisms that incurs latencies representative of next generation processors, prescient instruction prefetch can improve performance  ...  We demonstrate the need for enabling store-to-load communication and selective instruction execution when directly pre-executing future regions of an application that suffer I-cache misses.  ...  Tor Aamodt and Paul Chow were partly supported by funding from the Natural Sciences and Engineering Research Council of Canada.  ... 
doi:10.1109/hpca.2004.10028 dblp:conf/hpca/AamodtCHWS04 fatcat:rw4ez27t65hbpbwazo5sh57vse

Hints and Principles for Computer System Design [article]

Butler Lampson
2021 arXiv   pre-print
It also gives some principles for system design that are more than just hints, and many examples of how to apply the ideas.  ...  This new long version of my 1983 paper suggests the goals you might have for your system -- Simple, Timely, Efficient, Adaptable, Dependable, Yummy (STEADY) -- and techniques for achieving them -- Approximate  ...  The reasons are to do less total work (a form of speculation) or to reduce latency.  ... 
arXiv:2011.02455v3 fatcat:jolyz5lknjdbpjpxjcrx5rh6fa

P-Ray: A Software Suite for Multi-core Architecture Characterization [chapter]

Alexandre X. Duchateau, Albert Sidelnik, María Jesús Garzarán, David Padua
2008 Lecture Notes in Computer Science  
Currently, the task of determining the appropriate memory to use and the coding of data transfer between memories is still left to the programmer.  ...  One such burden is dealing with the complex memory hierarchy. Efficient and correct usage of the various memories is essential, making a difference of 2-17x in performance.  ...  Chen Ding suggested the lock switching scheme in the memory controller component; Brian Meeker collected some preliminary data in the early stage of this research.  ... 
doi:10.1007/978-3-540-89740-8_13 fatcat:hv2aoouhcve4xc2vlff77k7q4i
« Previous Showing results 1 — 15 out of 20 results