A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2006; you can also visit the original URL.
The file type is application/pdf
.
Filters
Software Data Prefetching for Software Pipelined Loops
1999
Journal of Parallel and Distributed Computing
This paper focuses on the interaction between software prefetching (both binding and nonbinding prefetch) and software pipelining for statically-scheduled machines. ...
It is also shown that the penalty of the stalls is in general higher than the effect of spill code. ...
Acknowledgments This work has been supported by the Spanish Ministry of Education under contract CICYT-TIC 511/98, the ESPRIT Project MHAOTEU (EP24942) and by the Catalan CIRIT under grant 1996FI-3083- ...
doi:10.1006/jpdc.1999.1553
fatcat:472wggwkknantjizdyjkju7m5a
SWOOP: software-hardware co-design for non-speculative, execute-ahead, in-order cores
2018
Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation - PLDI 2018
In this paper, we address one of the main performance bottlenecksÐlast-level cache missesÐthrough a softwarehardware co-design. ...
We aim to shift this complexity into software, and we build upon compilation techniques inherited from VLIW, software pipelining, modulo scheduling, decoupled access-execution, and software prefetching ...
Acknowledgements This work is supported, in part, by the Swedish Research Council UPMARC Linnaeus Centre and by the Swedish VR (grant no. 2016-05086). ...
doi:10.1145/3192366.3192393
dblp:conf/pldi/TranJCKSK18
fatcat:jsvxxfqkzvgrnnrtl7spfd4y6a
Static Instruction Scheduling for High Performance on Limited Hardware
2018
IEEE transactions on computers
To this end, Clairvoyance tackles (i) statically unknown dependencies, (ii) insufficient independent instructions, and (iii) register pressure. ...
Complex out-of-order (OoO) processors have been designed to overcome the restrictions of outstanding long-latency misses at the cost of increased energy consumption. ...
ACKNOWLEDGMENTS This work is supported, in part, by the Swedish Research Council UPMARC Linnaeus Centre and by the Swedish VR (grant no. 2016-05086). ...
doi:10.1109/tc.2017.2769641
fatcat:65lnszfaonatxdmo3ksxpsweau
Improving data cache performance by pre-executing instructions under a cache miss
1997
Proceedings of the 11th international conference on Supercomputing - ICS '97
The principal hardware cost is an extra register file. To measure the impact of runahead, we simulated a processor executing five integer Spec95 benchmarks. ...
Our results show that runahead was able to significantly reduce data cache CPI for four of the five benchmarks. ...
Confining prefetching to software approaches means that the hardware can be kept simple and fast, but prefetch instructions may cause code bloat, and increase register pressure. ...
doi:10.1145/263580.263597
dblp:conf/ics/DundasM97
fatcat:4aqqgmyazrfmte53haa6coihrm
On Instruction-Level Method for Reducing Cache Penalties in Embedded VLIW Processors
2009
2009 11th IEEE International Conference on High Performance Computing and Communications
Our method is based on a robust combination of memory pre-loading with data prefetching, allowing us to optimise both regular and irregular applications at the assembly level. ...
Second, the strides of memory accesses do not appear to be constant at source code level, because of indirect accesses. Hence, usual prefetching techniques are not applicable. ...
Acknowledgements This research result has been supported by the ANR MOPUCE project (number 05-JCJC-0039) and the French Ministry of Industry. ...
doi:10.1109/hpcc.2009.32
dblp:conf/hpcc/AmmenoucheTJ09
fatcat:5swqekbrajdoffnny5fb75anke
Improving data cache performance by pre-executing instructions under a cache miss
2014
25th Anniversary International Conference on Supercomputing Anniversary Volume -
The principal hardware cost is an extra register file. To measure the impact of runahead, we simulated a processor executing five integer Spec95 benchmarks. ...
Our results show that runahead was able to significantly reduce data cache CPI for four of the five benchmarks. ...
Confining prefetching to software approaches means that the hardware can be kept simple and fast, but prefetch instructions may cause code bloat, and increase register pressure. ...
doi:10.1145/2591635.2667173
fatcat:gujzigi23vegvbjvfdmcatgdua
Two-level hierarchical register file organization for VLIW processors
2000
Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture - MICRO 33
This degradation could be avoided if a high-capacity register file were included without causing a negative impact on the cycle time of the processor. ...
If more registers than those available in the architecture are required, some actions (such as spill code insertion) have to be applied to reduce this pressure, at the expense of some performance degradation ...
The higher capacity reduces spill code and allows the application of aggressive software prefetching techniques. ...
doi:10.1145/360128.360143
fatcat:ezkz65alirch5bqhneiiosqmze
Conjugate gradient solvers on Intel Xeon Phi and NVIDIA GPUs
[article]
2014
arXiv
pre-print
By exposing more parallelism to the accelerator through inverting multiple vectors at the same time, we obtain a performance greater than 300 GFlop/s on both architectures. ...
This more than doubles the performance of the inversions. ...
We acknowledge support from NVIDIA R through the CUDA Research Center program. ...
arXiv:1411.4439v1
fatcat:ucptsvldcraqbdf7ttbic74bh4
Code generation for hardware accelerated AES
2010
ASAP 2010 - 21st IEEE International Conference on Application-specific Systems, Architectures and Processors
We apply both common loop optimizations and ones specific to AES. We evaluate the generated code on hardware with built-in AES support using both selective-brute force and guided searches. ...
The AES algorithm consists of several 'rounds' of encryption, each of which involves a relatively complicated computation. ...
The combination of these smaller optimizations with the larger impact ones like interleaving, software pipelining, and keys in registers give us a significant improvement over our baselines.
B. ...
doi:10.1109/asap.2010.5540955
dblp:conf/asap/ManleyMG10
fatcat:mhjubdqev5fnxgfuid2zk4l2xe
A Study of the Performance Potential for Dynamic Instruction Hints Selection
[chapter]
2006
Lecture Notes in Computer Science
This paper discusses different instruction hints available on modern processor architectures and shows the potential performance impact on many benchmark programs. ...
They can be generated by the compiler and the post-link optimizer to reduce cache misses, improve branch prediction and minimize other performance bottlenecks. ...
The authors want to thank Abhinav Das and Jinpyo Kim for their suggestions and help. We also thank all of the anonymous reviewers for their valuable comments. ...
doi:10.1007/11859802_7
fatcat:tkw4ji4j5zca3j2otayn4ueugm
Integrating High-Level Optimizations in a Production Compiler: Design and Implementation Experience
[chapter]
2003
Lecture Notes in Computer Science
In particular, we describe decisions made in the design of HLO targeting Itanium processor family. We provide empirical data to validate the design decisions. ...
The High-Level Optimizer (HLO) is a key part of the compiler technology that enabled Itanium TM and Itanium TM 2 processors deliver leading floating-point performance at their introduction. ...
Also when prefetch relies on register rotation, the address copies are specially marked (shown as MCOPY in Fig. 5 ) for the software pipeliner. ...
doi:10.1007/3-540-36579-6_22
fatcat:4g726b35sbbbpd4jsirieuzfny
Do Trace Cache, Value Prediction and Prefetching Improve SMT Throughput?
[chapter]
2006
Lecture Notes in Computer Science
SMT's sharing of the instruction storage (i.e., trace cache or i-cache), physical registers, and issue queue impacts the effectiveness of trace cache, value prediction, and prefetching, respectively. ...
Our key contributions are: (1) we identify a fundamental interaction between the techniques and SMT's sharing of resources among multiple threads, and (2) we quantify the impact of this interaction on ...
Prefetching While prefetching can be implemented in either software [24, 14] or hardware, we focus on hardware prefetching in this study. Chen et al. ...
doi:10.1007/11682127_17
fatcat:xsb65e4pcnb37kh3jye2xrjrza
Some useful optimisations for unstructured computational fluid dynamics codes on multicore and manycore architectures
2018
Computer Physics Communications
residuals, data layout transformations for reducing cache misses, hand-tuned gather and scatter primitives for in-register transpositions, software prefetching via auto-tuning and multithreading for exploiting ...
We provide implementations for a number of optimisations useful for improving the performance of unstructured CFD codes on modern multicore and manycore architectures. ...
The authors are particularly indebted to Timothy Jones at the University of Cambridge for discussions and help with software prefetching, David Power and Konstantinos Mouzakitis at Boston Limited for access ...
doi:10.1016/j.cpc.2018.07.001
fatcat:udpf725opbb5fhoygfwhkh3g7i
Compositional approach applied to loop specialization
2009
Concurrency and Computation
Then we demonstrate the benefit of our method on kernels optimized with software pipeline, with detailed experimental results. These experiments were conducted in a semi-automated manner. ...
Hence, the resulting code achieves the same level of performance than each version on its specific iteration interval. ...
This does not yield to excessive register pressure. In fact, the global register pressure depends on the number of iterations simultaneously alive. ...
doi:10.1002/cpe.1337
fatcat:g2r7h2jsanbytkom7wjpv5ezzu
Implementing virtual memory in a vector processor with software restart markers
2006
Proceedings of the 20th annual international conference on Supercomputing - ICS '06
In this paper, we propose a new exception handling model for vector architectures based on software restart markers, which divide the program into idempotent regions of code. ...
Our scheme also removes the requirement of preserving vector register file contents in the event of a context switch. ...
ACKNOWLEDGMENTS We thank the anonymous reviewers for their comments. This work was partly funded by NSF CAREER award CCR-0093354, DARPA PAC/C award F30602-00-2-0562, and the Cambridge-MIT Institute. ...
doi:10.1145/1183401.1183422
dblp:conf/ics/HamptonA06
fatcat:l32k6jssnbhmfknx7lt56b3urq
« Previous
Showing results 1 — 15 out of 1,221 results