Filters








1,221 Hits in 3.2 sec

Software Data Prefetching for Software Pipelined Loops

Jesús Sánchez, Antonio González
1999 Journal of Parallel and Distributed Computing  
This paper focuses on the interaction between software prefetching (both binding and nonbinding prefetch) and software pipelining for statically-scheduled machines.  ...  It is also shown that the penalty of the stalls is in general higher than the effect of spill code.  ...  Acknowledgments This work has been supported by the Spanish Ministry of Education under contract CICYT-TIC 511/98, the ESPRIT Project MHAOTEU (EP24942) and by the Catalan CIRIT under grant 1996FI-3083-  ... 
doi:10.1006/jpdc.1999.1553 fatcat:472wggwkknantjizdyjkju7m5a

SWOOP: software-hardware co-design for non-speculative, execute-ahead, in-order cores

Kim-Anh Tran, Alexandra Jimborean, Trevor E. Carlson, Konstantinos Koukos, Magnus Själander, Stefanos Kaxiras
2018 Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation - PLDI 2018  
In this paper, we address one of the main performance bottlenecksÐlast-level cache missesÐthrough a softwarehardware co-design.  ...  We aim to shift this complexity into software, and we build upon compilation techniques inherited from VLIW, software pipelining, modulo scheduling, decoupled access-execution, and software prefetching  ...  Acknowledgements This work is supported, in part, by the Swedish Research Council UPMARC Linnaeus Centre and by the Swedish VR (grant no. 2016-05086).  ... 
doi:10.1145/3192366.3192393 dblp:conf/pldi/TranJCKSK18 fatcat:jsvxxfqkzvgrnnrtl7spfd4y6a

Static Instruction Scheduling for High Performance on Limited Hardware

Kim-Anh Tran, Trevor E. Carlson, Konstantinos Koukos, Magnus Sjalander, Vasileios Spiliopoulos, Stefanos Kaxiras, Alexandra Jimborean
2018 IEEE transactions on computers  
To this end, Clairvoyance tackles (i) statically unknown dependencies, (ii) insufficient independent instructions, and (iii) register pressure.  ...  Complex out-of-order (OoO) processors have been designed to overcome the restrictions of outstanding long-latency misses at the cost of increased energy consumption.  ...  ACKNOWLEDGMENTS This work is supported, in part, by the Swedish Research Council UPMARC Linnaeus Centre and by the Swedish VR (grant no. 2016-05086).  ... 
doi:10.1109/tc.2017.2769641 fatcat:65lnszfaonatxdmo3ksxpsweau

Improving data cache performance by pre-executing instructions under a cache miss

James Dundas, Trevor Mudge
1997 Proceedings of the 11th international conference on Supercomputing - ICS '97  
The principal hardware cost is an extra register file. To measure the impact of runahead, we simulated a processor executing five integer Spec95 benchmarks.  ...  Our results show that runahead was able to significantly reduce data cache CPI for four of the five benchmarks.  ...  Confining prefetching to software approaches means that the hardware can be kept simple and fast, but prefetch instructions may cause code bloat, and increase register pressure.  ... 
doi:10.1145/263580.263597 dblp:conf/ics/DundasM97 fatcat:4aqqgmyazrfmte53haa6coihrm

On Instruction-Level Method for Reducing Cache Penalties in Embedded VLIW Processors

Samir Ammenouche, Sid-Ahmed-Ali Touati, William Jalby
2009 2009 11th IEEE International Conference on High Performance Computing and Communications  
Our method is based on a robust combination of memory pre-loading with data prefetching, allowing us to optimise both regular and irregular applications at the assembly level.  ...  Second, the strides of memory accesses do not appear to be constant at source code level, because of indirect accesses. Hence, usual prefetching techniques are not applicable.  ...  Acknowledgements This research result has been supported by the ANR MOPUCE project (number 05-JCJC-0039) and the French Ministry of Industry.  ... 
doi:10.1109/hpcc.2009.32 dblp:conf/hpcc/AmmenoucheTJ09 fatcat:5swqekbrajdoffnny5fb75anke

Improving data cache performance by pre-executing instructions under a cache miss

James Dundas, Trevor Mudge
2014 25th Anniversary International Conference on Supercomputing Anniversary Volume -  
The principal hardware cost is an extra register file. To measure the impact of runahead, we simulated a processor executing five integer Spec95 benchmarks.  ...  Our results show that runahead was able to significantly reduce data cache CPI for four of the five benchmarks.  ...  Confining prefetching to software approaches means that the hardware can be kept simple and fast, but prefetch instructions may cause code bloat, and increase register pressure.  ... 
doi:10.1145/2591635.2667173 fatcat:gujzigi23vegvbjvfdmcatgdua

Two-level hierarchical register file organization for VLIW processors

Javier Zalamea, Josep Llosa, Eduard Ayguadé, Mateo Valero
2000 Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture - MICRO 33  
This degradation could be avoided if a high-capacity register file were included without causing a negative impact on the cycle time of the processor.  ...  If more registers than those available in the architecture are required, some actions (such as spill code insertion) have to be applied to reduce this pressure, at the expense of some performance degradation  ...  The higher capacity reduces spill code and allows the application of aggressive software prefetching techniques.  ... 
doi:10.1145/360128.360143 fatcat:ezkz65alirch5bqhneiiosqmze

Conjugate gradient solvers on Intel Xeon Phi and NVIDIA GPUs [article]

O. Kaczmarek, C. Schmidt, P. Steinbrecher, M. Wagner
2014 arXiv   pre-print
By exposing more parallelism to the accelerator through inverting multiple vectors at the same time, we obtain a performance greater than 300 GFlop/s on both architectures.  ...  This more than doubles the performance of the inversions.  ...  We acknowledge support from NVIDIA R through the CUDA Research Center program.  ... 
arXiv:1411.4439v1 fatcat:ucptsvldcraqbdf7ttbic74bh4

Code generation for hardware accelerated AES

Raymond Manley, Paul Magrath, David Gregg
2010 ASAP 2010 - 21st IEEE International Conference on Application-specific Systems, Architectures and Processors  
We apply both common loop optimizations and ones specific to AES. We evaluate the generated code on hardware with built-in AES support using both selective-brute force and guided searches.  ...  The AES algorithm consists of several 'rounds' of encryption, each of which involves a relatively complicated computation.  ...  The combination of these smaller optimizations with the larger impact ones like interleaving, software pipelining, and keys in registers give us a significant improvement over our baselines. B.  ... 
doi:10.1109/asap.2010.5540955 dblp:conf/asap/ManleyMG10 fatcat:mhjubdqev5fnxgfuid2zk4l2xe

A Study of the Performance Potential for Dynamic Instruction Hints Selection [chapter]

Rao Fu, Jiwei Lu, Antonia Zhai, Wei-Chung Hsu
2006 Lecture Notes in Computer Science  
This paper discusses different instruction hints available on modern processor architectures and shows the potential performance impact on many benchmark programs.  ...  They can be generated by the compiler and the post-link optimizer to reduce cache misses, improve branch prediction and minimize other performance bottlenecks.  ...  The authors want to thank Abhinav Das and Jinpyo Kim for their suggestions and help. We also thank all of the anonymous reviewers for their valuable comments.  ... 
doi:10.1007/11859802_7 fatcat:tkw4ji4j5zca3j2otayn4ueugm

Integrating High-Level Optimizations in a Production Compiler: Design and Implementation Experience [chapter]

Somnath Ghosh, Abhay Kanhere, Rakesh Krishnaiyer, Dattatraya Kulkarni, Wei Li, Chu-Cheow Lim, John Ng
2003 Lecture Notes in Computer Science  
In particular, we describe decisions made in the design of HLO targeting Itanium processor family. We provide empirical data to validate the design decisions.  ...  The High-Level Optimizer (HLO) is a key part of the compiler technology that enabled Itanium TM and Itanium TM 2 processors deliver leading floating-point performance at their introduction.  ...  Also when prefetch relies on register rotation, the address copies are specially marked (shown as MCOPY in Fig. 5 ) for the software pipeliner.  ... 
doi:10.1007/3-540-36579-6_22 fatcat:4g726b35sbbbpd4jsirieuzfny

Do Trace Cache, Value Prediction and Prefetching Improve SMT Throughput? [chapter]

Chen-Yong Cher, Il Park, T. N. VijayKumar
2006 Lecture Notes in Computer Science  
SMT's sharing of the instruction storage (i.e., trace cache or i-cache), physical registers, and issue queue impacts the effectiveness of trace cache, value prediction, and prefetching, respectively.  ...  Our key contributions are: (1) we identify a fundamental interaction between the techniques and SMT's sharing of resources among multiple threads, and (2) we quantify the impact of this interaction on  ...  Prefetching While prefetching can be implemented in either software [24, 14] or hardware, we focus on hardware prefetching in this study. Chen et al.  ... 
doi:10.1007/11682127_17 fatcat:xsb65e4pcnb37kh3jye2xrjrza

Some useful optimisations for unstructured computational fluid dynamics codes on multicore and manycore architectures

Ioan Hadade, Feng Wang, Mauro Carnevale, Luca di Mare
2018 Computer Physics Communications  
residuals, data layout transformations for reducing cache misses, hand-tuned gather and scatter primitives for in-register transpositions, software prefetching via auto-tuning and multithreading for exploiting  ...  We provide implementations for a number of optimisations useful for improving the performance of unstructured CFD codes on modern multicore and manycore architectures.  ...  The authors are particularly indebted to Timothy Jones at the University of Cambridge for discussions and help with software prefetching, David Power and Konstantinos Mouzakitis at Boston Limited for access  ... 
doi:10.1016/j.cpc.2018.07.001 fatcat:udpf725opbb5fhoygfwhkh3g7i

Compositional approach applied to loop specialization

L. Djoudi, J.-T. Acquaviva, D. Barthou
2009 Concurrency and Computation  
Then we demonstrate the benefit of our method on kernels optimized with software pipeline, with detailed experimental results. These experiments were conducted in a semi-automated manner.  ...  Hence, the resulting code achieves the same level of performance than each version on its specific iteration interval.  ...  This does not yield to excessive register pressure. In fact, the global register pressure depends on the number of iterations simultaneously alive.  ... 
doi:10.1002/cpe.1337 fatcat:g2r7h2jsanbytkom7wjpv5ezzu

Implementing virtual memory in a vector processor with software restart markers

Mark Hampton, Krste Asanović
2006 Proceedings of the 20th annual international conference on Supercomputing - ICS '06  
In this paper, we propose a new exception handling model for vector architectures based on software restart markers, which divide the program into idempotent regions of code.  ...  Our scheme also removes the requirement of preserving vector register file contents in the event of a context switch.  ...  ACKNOWLEDGMENTS We thank the anonymous reviewers for their comments. This work was partly funded by NSF CAREER award CCR-0093354, DARPA PAC/C award F30602-00-2-0562, and the Cambridge-MIT Institute.  ... 
doi:10.1145/1183401.1183422 dblp:conf/ics/HamptonA06 fatcat:l32k6jssnbhmfknx7lt56b3urq
« Previous Showing results 1 — 15 out of 1,221 results