Filters








51,056 Hits in 7.4 sec

Post-pass binary adaptation for software-based speculative precomputation

Steve S.W. Liao, Perry H. Wang, Hong Wang, Gerolf Hoflehner, Daniel Lavery, John P. Shen
2002 Proceedings of the ACM SIGPLAN 2002 Conference on Programming language design and implementation - PLDI '02  
Speculative threads can be spawned by one of two events: a basic trigger, which occurs when a designated trigger instruction in the non-speculative thread is retired, or a chaining trigger, by which one  ...  Our results indicate that for a set of pointer-intensive benchmarks, the prefetching performed by the speculative threads achieves an average of 87% speedup on an in-order processor and 5% speedup on an  ...  ACKNOWLEDGEMENTS We thank Tor Aamodt, Murali Annavaram, Jesse Fang, Monica Lam, Yong-fong Lee, Ken Lueh, Justin Rattner, and Xinmin Tian for their valuable comments on this paper.  ... 
doi:10.1145/512529.512544 dblp:conf/pldi/LiaoWWSHL02 fatcat:hidw6rtajfd5vffdg2havbqi4e

Post-pass binary adaptation for software-based speculative precomputation

Steve S.W. Liao, Perry H. Wang, Hong Wang, Gerolf Hoflehner, Daniel Lavery, John P. Shen
2002 Proceedings of the ACM SIGPLAN 2002 Conference on Programming language design and implementation - PLDI '02  
Speculative threads can be spawned by one of two events: a basic trigger, which occurs when a designated trigger instruction in the non-speculative thread is retired, or a chaining trigger, by which one  ...  Our results indicate that for a set of pointer-intensive benchmarks, the prefetching performed by the speculative threads achieves an average of 87% speedup on an in-order processor and 5% speedup on an  ...  ACKNOWLEDGEMENTS We thank Tor Aamodt, Murali Annavaram, Jesse Fang, Monica Lam, Yong-fong Lee, Ken Lueh, Justin Rattner, and Xinmin Tian for their valuable comments on this paper.  ... 
doi:10.1145/512541.512544 fatcat:nshqi5hd4rh2jbrrzpodutdw5y

Post-pass binary adaptation for software-based speculative precomputation

Steve S.W. Liao, Perry H. Wang, Hong Wang, Gerolf Hoflehner, Daniel Lavery, John P. Shen
2002 SIGPLAN notices  
Speculative threads can be spawned by one of two events: a basic trigger, which occurs when a designated trigger instruction in the non-speculative thread is retired, or a chaining trigger, by which one  ...  Our results indicate that for a set of pointer-intensive benchmarks, the prefetching performed by the speculative threads achieves an average of 87% speedup on an in-order processor and 5% speedup on an  ...  ACKNOWLEDGEMENTS We thank Tor Aamodt, Murali Annavaram, Jesse Fang, Monica Lam, Yong-fong Lee, Ken Lueh, Justin Rattner, and Xinmin Tian for their valuable comments on this paper.  ... 
doi:10.1145/543552.512544 fatcat:5mi2fvfb3bcinfnxcbu3gglhue

Global instruction scheduling for superscalar machines

David Bernstein, Michael Rodeh
1991 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation - PLDI '91  
A scheme for global (intra-loop) scheduling is proposed, which uses the control and data dependence information summarized in a Program Dependence Graph, to move instructions well beyond basic block boundaries  ...  This novel scheduling framework is based on the parametric description of the machine architecture, which spans a range of superscakis and VLIW machines, and exploits speculative execution of instructions  ...  (as well as enhanced percolation scheduling) does not depend on such assumption.  ... 
doi:10.1145/113445.113466 dblp:conf/pldi/BernsteinR91 fatcat:obxxbwsovncyjcg7zx2bgaz6nu

Chip multi-processor scalability for single-threaded applications

Neil Vachharajani, Matthew Iyer, Chinmay Ashok, Manish Vachharajani, David I. August, Daniel Connors
2005 SIGARCH Computer Architecture News  
Using the results from this analysis, the paper forecasts that CMPs, using the "intrinsic" parallelism in a program, can sustain the performance improvement users have come to expect from new processors  ...  This paper examines the scalability potential for exploiting the parallelism in single-threaded applications on these CMP platforms.  ...  If control dependences are ignored, programs such as 176.gcc, 181.mcf, and 197.parser achieve IPCs in excess of one hundred and in some cases in excess of one thousand.  ... 
doi:10.1145/1105734.1105741 fatcat:3axmkqu2avfphhypkrmj5f5hwi

Instruction scheduling over regions: A framework for scheduling across basic blocks [chapter]

Uma Mahadevan, Sridhar Ramakrishnan
1994 Lecture Notes in Computer Science  
within a region.  ...  Within each basic block, a directed acyclic graph represents the dependence information, while a definition matrix in conjunction with a path matrix represents the overall control and data dependence information  ...  Legality Legality of movement refers to the preservation of correct program behavior by ensuring that all data and control dependencies are honored by the transformations on the program graph associated  ... 
doi:10.1007/3-540-57877-3_28 fatcat:tsynmpalmjalrhnwxudtl2casm

Graph partitioning applied to DAG scheduling to reduce NUMA effects

Isaac Sánchez Barrera, Eduard Ayguadé, Marc Casas, Jesús Labarta, Miquel Moretó, Mateo Valero
2018 Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming - PPoPP '18  
We propose techniques at the runtime system level to reduce NUMA effects on parallel applications. We leverage runtime system metadata in terms of a task dependency graph.  ...  Our approach, based on graph partitioning methods, is able to provide parallel performance improvements of 1.12× on average with respect to the state-of-the-art.  ...  and enhanced workpushing, where tasks are scheduled in the NUMA region containing most of their data dependencies.  ... 
doi:10.1145/3178487.3178535 dblp:conf/ppopp/BarreraCMALV18 fatcat:ir7l2elumrgmzem2n2eld6qa5m

Dynamic Memory Management in the Loci Framework [chapter]

Yang Zhang, Edward A. Luke
2005 Lecture Notes in Computer Science  
This paper presents a dynamic memory management scheme for a declarative high-performance data-parallel programming system -the Loci framework.  ...  Resource management is a critical concern in high-performance computing software.  ...  Region inference [8] is a relatively new form of automatic memory management. It relies on static program analysis and is a compile-time method and uses the region concept.  ... 
doi:10.1007/11428848_101 fatcat:letzi2iw7jb2remomk6dbcjsui

A domain-specific high-level programming model

Farouk Mansouri, Sylvain Huet, Dominque Houzet
2015 Concurrency and Computation  
However, it is necessary for the user to manually tune the data allocation and transfer, and the scheduling of parallel regions on GPU architecture.  ...  In addition, to obtain good performance, users have to focus on allocation and communication of data, and scheduling of parallel region onto threads.  ... 
doi:10.1002/cpe.3622 fatcat:5t6nnj62rfbb3dvpeyi7tmvvcy

Compilers for instruction-level parallelism

M. Schlansker, T.M. Conte, J. Dehnert, K. Ebcioglu, J.Z. Fang, C.L. Thompson
1997 Computer  
ILP in a software-centric approach employs a very long instruction word (VLIW) processor and relies on a compiler to statically parallelize and schedule code.  ...  I nstruction-level parallelism allows a sequence of instructions derived from a sequential program to be parallelized for execution on multiple pipelined functional units.  ...  Schedulers typically operate either on entire procedures or on program regions excerpted from a procedure.  ... 
doi:10.1109/2.642817 fatcat:sqa3irdg3zcqzftmok3rpsv65a

Global Multi-Threaded Instruction Scheduling

Guilherme Ottoni, David August
2007 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007)  
This section first provides some background on Program Dependence Graphs (PDG), and then demonstrates how PDGs enable global MT instruction scheduling.  ...  Program Dependences and Multi-Threaded Instruction Scheduling Program dependences constitute an important abstraction in compiler optimization and parallelization.  ...  The authors acknowledge the support of the GSRC Focus Center, one of five research centers funded under the Focus Center Research Program, a Semiconductor Research Corporation program.  ... 
doi:10.1109/micro.2007.32 dblp:conf/micro/OttoniA07 fatcat:lkiem6welja2noa6d3vvwwpohu

VLIW compilation techniques in a superscalar environment

Kemal Ebcioglu, Randy D. Groves, Ki-Chang Kim, Gabriel M. Silberman, Isaac Ziv
1994 SIGPLAN notices  
We describe techniques for converting the intermediate code representation of a given program, as generated by a profiling directed feedback, including scheduling heuristics, code reordering and branch  ...  A PDG for a procedure has two parts, a control dependence graph and a data dependence graph.  ...  Also, Unrolling, Renaming, Global Scheduling, Software Pipelining The regions of the program are compacted through the combination of global scheduling [10] and enhanced pipeline scheduling  ... 
doi:10.1145/773473.178247 fatcat:fcg6iov7pfgkpe2me44dhng274

High-Performance and Time-Predictable Embedded Computing [chapter]

Luis Miguel Pinho, Eduardo Quinones, Marko Bertogna, Andrea Marongiu, Vincent Nélis, Paolo Gai, Juan Sancho
2018 High-Performance and Time-Predictable Embedded Computing  
In a context where hardware vendors used to implement their own proprietary versions of threads, Pthreads arose with the aim of enhancing the portability of threaded applications that reside on shared  ...  The language specifies a programming language based on C99 used to control the host, and a standard interface for parallel computing, which exploits task-based and data-based parallelism, used to control  ...  This assumes the execution of a single DAG program, where a node cannot be interrupted to execute other nodes of the same graph.  ... 
doi:10.13052/rp-9788793609624 fatcat:k7s3qrikfrclffppj2n2qjk4ny

HTS: A Hardware Task Scheduler for Heterogeneous Systems [article]

Kartik Hegde, Abhishek Srivastava, Rohit Agrawal
2019 arXiv   pre-print
Compared to executing the benchmark on a system with sequential scheduling, proposed scheduler achieves up to 12x improvement in performance.  ...  A massively heterogeneous system with a large number of hardware accelerators along with multiple general purpose CPUs is a promising direction, but pose several challenges in terms of the run-time scheduling  ...  A program instance is defined in terms of its logical regions, which express locality and independence of program data, and tasks.  ... 
arXiv:1907.00271v1 fatcat:5wsgr5yzcfft7lkqz4m555kqym

OpenMP to CUDA graphs

Chenle Yu, Sara Royuela, Eduardo Quiñones
2020 Proceedings of the 23th International Workshop on Software and Compilers for Embedded Systems  
This paper presents a novel compiler transformation technique that automatically transforms OpenMP code into CUDA graphs, combining the benefits of programmability of a high-level programming model such  ...  as OpenMP, with the performance benefits of a low-level programming model such as CUDA.  ...  ., all code is taskified within a given region, then, the TDG of that region represents an execution flow that can be perfectly mapped to a CUDA graph.  ... 
doi:10.1145/3378678.3391881 dblp:conf/scopes/YuRQ20 fatcat:ryye746lwvg2bhol3w7qef5ugy
« Previous Showing results 1 — 15 out of 51,056 results