A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2017; you can also visit the original URL.
The file type is
Interaction between the phases of register allocation and instruction scheduling are often consid ered in publications devoted to optimizations for the final stage of compilation. ... However, their inte gration can essentially reduce the time of operation and enhance the performance of the resulting code. ... and the spill/load instructions. ...doi:10.1134/s0361768810060058 fatcat:qldzypnqozf4dhrgnkkp3du7aq
binary trees with spills and pipelined loads. ... at a time, under the restrictions that the dependence graph is a full binary tree, all arithmetic and store operations have unit latency, and all load operations have a latency of | or all load operations ...
This paper describes a compiler for stream programs that efficiently schedules computational kernels and stream memory operations, and allocates on-chip storage. ... Our compiler uses information about the program structure and estimates of kernel and memory operation execution times to overlap kernel execution with memory transfers, maximizing performance, and to ... Since the memory load mem could execute in parallel with the kernel op1, buffers for c and d have been extended with shadows. ...doi:10.1145/1152154.1152164 dblp:conf/IEEEpact/DasDM06 fatcat:6zzxk5sffbdprhtoa3iw6vfnjq
The Active Memory Cube (AMC) is a novel nearmemory processor that exploits high memory bandwidth and low latency close to DRAM to execute scientific applications in an energy-efficient manner. ... with the architecture. ... This diminished the possibility of scheduling binary arithmetic and memory instructions in parallel. ...doi:10.1109/sbac-pad.2015.18 dblp:conf/sbac-pad/JacobNCSKBAO15 fatcat:2tsunsbztzbdhjuxn5zobup3wi
Acknowledgments We thank Jim Kahle, Ted Maeurer, Jaime Moreno, and Alexandre Eichenberger for their many comments and suggestions in the preparation of this work. ... We also thank Valentina Salapura for her help and numerous suggestions in the preparation of this article. ... Figure 4b shows that SIMD data-parallel operations cannot readily be used for operations on scalar elements with arbitrary alignment loaded into a vector register using the quadword load operations. ...doi:10.1109/mm.2006.41 fatcat:tt5nh6bppzdnxh6rhwfdcq7gle
In this paper, we present an effective code generation algorithm named Rotation Scheduling with Spill Codes Predicting (RSSP) to maximally exploit the benefits of non-orthogonal architectures. ... Furthermore, we also present some preliminary ideas to generalize RSSP, which can make it more practicable and suit various DSPs with similar architectural features. ... Then, in the code compaction phase, these spill codes can be scheduled in parallel with other operations, this can decrease the spill costs. ...doi:10.1007/s11265-007-0053-x fatcat:gorrmf5ue5fpblaub2vdi2rb7i
The VLIW architecture can be exploited to greatly enhance instruction level parallelism, thus it can provide computation power and energy efficiency advantages, which satisfies the requirements of future ... We have implemented several advanced optimization techniques in the compiler, and fulfilled the O3 level optimization. Benchmarks from the DSPstone test suite are used to verify the compiler. ... Also, the Magnolia compiler supports several addressing modes for load and store operations, which is quite useful in the DSP domain. ...doi:10.3390/s120404466 pmid:22666040 pmcid:PMC3355421 fatcat:isbbzi3osfcgfo3cyypsmmvhkm
Traditionally, software developers have programmed DSPs in assembly language for efficiency. This implies time-consuming programming, extensive debugging, and little or no code portability. ... These constraints, along with a relatively narrow application domain, have led designers to create special architectural features, as found in the Harvard architecture, VLIW (very long instruction word ... Acknowledgments This work is supported by Infineon Technologies Austria, the European Commission under the project SoCMobinet (IST-2000-30094) and the Christian Doppler Gesellschaft. ...doi:10.1109/mm.2004.40 fatcat:lclck3wx2zd4zkilqn2oanqpdq
Proceedings of the 30th ACM SIGPLAN International Conference on Compiler Construction
In particular, novel parallelization algorithms and intermediate representations are required. ... volume, parallelization degree). ... The tool allows for custom arithmetic operator generation based on target, required frequency and arithmetic precision. ...doi:10.1145/3446804.3446847 fatcat:pyhil53nuzg2hk2dc7pbj7zh6q
With advances in VLSI technology, microprocessor designers can provide more microarchitectural parallelism to increase performance. ... The experiments reported in this paper address two important issues: the effects of these forms and the appropriate balance among them. ... Acknowledgements The authors would like to acknowledge Nancy Warter, Sharon Simonson, Sadun Anik. and the other members of the Computer System Group for their invaluable comments and suggestions. ...doi:10.1145/633625.52406 fatcat:tlg7u7micvgohpktjmkkwpqk4y
90m:68019 68 90m:68019 68N20 68Q25 Bernstein, David (Bernstein, David Josef] (1-IBM); Jaffe, Jeffrey M. (1-IBM); Rodeh, Michael (IL-IBM) Scheduiing arithmetic and load operations in parallel with no spilling ... Comput. 18 (1989), no. 6, 1098-1127. Summary: “A machine model in which load operations can be performed in parallel with arithmetic operations by two separate functional units is considered. ...
A companion paper in this issue  presents a survey of application and architecture trends for embedded systems in these growth markets. ... The increasing use of embedded software, often implemented on a core processor in a single-chip system, is a clear trend in the telecommunications, multimedia, and consumer electronics industries. ... Communication between memories and registers requires separate "load" and "store" operations, which may be scheduled in parallel with arithmetic operations if permitted by the instruction set. ...doi:10.1109/5.558718 fatcat:jtn2aeo4ybcwfgc67rdsdqjhei
To execute more operations per cycle, current processors are designed with growing degrees of resource replication (replication technique) for memory ports and functional units. ... Also, we confirm that multiply-add fused units will have a significant impact in raising the performance of future processors architectures with a reasonable increase in cost. ... ACKNOWLEDGMENTS This work has been supported by the Ministry of Culture and Education of Spain under contract TIC 98-0511 and by CEPBA (European Centre for Parallelism of Barcelona). ...doi:10.1109/12.956090 fatcat:mnknjyvb3fexbir3rx3xrqzc7q
Traditional RISC architectures use hardware approaches to obtain more instruction-level parallelism, with the compiler and the operating system (OS) having only indirect visibility into the mechanisms ... The IA-64 architecture  was specifically designed to enable systems which create and exploit high levels of instructionlevel parallelism by explicitly encoding a program's parallelism in the instruction ... , Shashikant Rao, and Carol Thompson. ...doi:10.1145/384264.379242 fatcat:iprphesjo5bfxpp2zmvkfyjkfq
Proceedings of the ninth international conference on Architectural support for programming languages and operating systems - ASPLOS-IX
Traditional RISC architectures use hardware approaches to obtain more instruction-level parallelism, with the compiler and the operating system (OS) having only indirect visibility into the mechanisms ... The IA-64 architecture  was specifically designed to enable systems which create and exploit high levels of instructionlevel parallelism by explicitly encoding a program's parallelism in the instruction ... , Shashikant Rao, and Carol Thompson. ...doi:10.1145/378993.379242 fatcat:pli4llzbivbk3h4qy2h676a74m
« Previous Showing results 1 — 15 out of 1,666 results