16,737 Hits in 5.0 sec

Global Multi-Threaded Instruction Scheduling

Guilherme Ottoni, David August
2007 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007)  
Program Dependences and Multi-Threaded Instruction Scheduling Program dependences constitute an important abstraction in compiler optimization and parallelization.  ...  In this paper, we first propose a framework that enables global multi-threaded instruction scheduling in general. We then describe GREMIO, a scheduler built using this framework.  ...  Acknowledgments We thank the entire Liberty Research Group and Vivek Sarkar for their feedback during this work. Additionally, we thank the anonymous reviewers for their insightful comments.  ... 
doi:10.1109/micro.2007.32 dblp:conf/micro/OttoniA07 fatcat:lkiem6welja2noa6d3vvwwpohu

A Scalable, Multi-thread, Multi-issue Array Processor Architecture for DSP Applications Based on Extended Tomasulo Scheme [chapter]

Mladen Bereković, Tim Niggemeier
2006 Lecture Notes in Computer Science  
threads (SMT).  ...  thread processing units, ALUs, registers files and memories are distributed across the chip and communicate with each other by special networks, forming a "network-on-a-chip" (NOC) [1].  ...  The benchmarks were hand-optimized in Assembler for the architecture without compiler-support and exploit multi-threading.  ... 
doi:10.1007/11796435_30 fatcat:c6tgfumaubfjhk75o734u775he

Distributed order scheduling and its application to multi-core dram controllers

Thomas Moscibroda, Onur Mutlu
2008 Proceedings of the twenty-seventh ACM symposium on Principles of distributed computing - PODC '08  
We show that without communication, the average completion time of all orders can be by a factor Ω( √ n) worse than in the optimal schedule.  ...  Specifically, we devise a distributed scheduling algorithm that, for any k, achieves an approximation ratio of O(k) in n/k communication rounds.  ...  Instruction Traces for Application 0 Instruction Traces for Application 1 Instruction Traces for Application 2 Instruction Traces for Application 3 L1 CACHE DATA L2 CACHE L1 CACHE  ... 
doi:10.1145/1400751.1400799 dblp:conf/podc/MoscibrodaM08 fatcat:alzowvidfnaytoibhw5g6ibsje

Scheduling analysis from architectural models of embedded multi-processor systems

Stéphane Rubini, Christian Fotsing, Frank Singhoff, Hai Nam Tran, Pierre Dissaux
2014 ACM SIGBED Review  
We also define how the AADL model must be written to express the standard policies for the multi-processor scheduling.  ...  To this end, this paper presents the extension of the scheduling analysis tool Cheddar to deal with multi-processor scheduling.  ...  For scheduling analysis with global scheduling, Cheddar offers less features. For this class of multiprocessor architectures, Cheddar only offers scheduling analysis by scheduling simulation.  ... 
doi:10.1145/2597457.2597467 fatcat:wywkkwprbjhbzixptfvsmh7gpi

Optimization principles and application performance evaluation of a multithreaded GPU using CUDA

Shane Ryoo, Christopher I. Rodrigues, Sara S. Baghsorkhi, Sam S. Stone, David B. Kirk, Wen-mei W. Hwu
2008 Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming - PPoPP '08  
The resources to manage include the number of registers and the amount of on-chip memory used per thread, number of threads per multiprocessor, and global memory bandwidth.  ...  The NVIDIA CUDA programming model [3] was created for developing applications for this platform.  ...  We also thank the other members of the IMPACT research group for their support.  ... 
doi:10.1145/1345206.1345220 dblp:conf/ppopp/RyooRBSKH08 fatcat:vyn3h5q52zejxle3x7n6urz4tm

Experiences with OpenMP in tmLQCD [article]

A. Deuzeman, K. Jansen, B. Kostrzewa, C. Urbach
2013 arXiv   pre-print
An overview is given of the lessons learned from the introduction of multi-threading using OpenMP in tmLQCD.  ...  In particular, programming style, performance measurements, cache misses, scaling, thread distribution for hybrid codes, race conditions, the overlapping of communication and computation and the measurement  ...  On the other hand, for the L = 8 measurement, performance without communication is severely degraded, suggesting inefficiencies in multi-threading.  ... 
arXiv:1311.4521v1 fatcat:xnyzl4onendndlee6glkdik2se

GC3: An Optimizing Compiler for GPU Collective Communication [article]

Meghan Cowan, Saeed Maleki, Madanlal Musuvathi, Olli Saarikivi, Yifan Xiong
2022 arXiv   pre-print
GC3 provides a domain specific language for writing collective communication algorithms and an optimizing compiler for lowering them to an executable form, which can be executed efficiently and flexibly  ...  Custom collective algorithms optimized for both particular network topologies and application specific communication patterns can alleviate this bottleneck and help these applications scale.  ...  DSL Scheduling Directives Optimizing a program's schedule is crucial for extracting performance.  ... 
arXiv:2201.11840v3 fatcat:542qd5tmozb3vphtjplgshjmiu

An Optimized and Efficient Multi Parametric Scheduling Approach for Multi-Core Systems

Sonia Mittal, Priyanka Sharma
2013 Journal of clean energy technologies  
For Efficient scheduling of task or thread on multi core system the operating system scheduler must be aware about the underlying heterogeneity present in the system, also it must be aware about the characteristics  ...  Operating System procedures must cover issues at lower abstraction layers, close to firmware, in order to enable features like optimal task/thread level scheduling depending upon the application requirements  ...  One of the main problems with threads, however, is Sonia Mittal and Priyanka Sharma An Optimized and Efficient Multi Parametric Scheduling Approach for Multi-Core Systems that their memory access behavior  ... 
doi:10.7763/ijcte.2013.v5.716 fatcat:kcrxdhlpafaylofc25zol7d43y

Optimizing coarse-grain reconfigurable hardware utilization through multiprocessing: an H.264/AVC decoder example

Andreas Kanstein, Sebastian López Suárez, Bjorn De Sutter, Valentín de Armas Sosa, Kamran Eshraghian, Félix B. Tobajas
2007 VLSI Circuits and Systems III  
Coarse-grained reconfigurable architectures offer high execution acceleration for code which has high instruction-level parallelism (ILP), typically for large kernels in DSP applications.  ...  We introduce a multi-processing extension to t he coarse-grained reconfigurable architecture ADRES (Architecture for Dynamically Reconfigurable Embedded Systems) to deal with this kind of applications,  ...  For the supercomputing domain, a coarse-grained array architecture which employs static allocation and dynamic scheduling for dynamic multi-threading has been proposed and developed [8] .  ... 
doi:10.1117/12.722077 fatcat:kg5dvj7p7zerlmckn5uacazsli

A Single-Path Chip-Multiprocessor System [chapter]

Martin Schoeberl, Peter Puschner, Raimund Kirner
2009 Lecture Notes in Computer Science  
Time-sliced arbitration of the main memory access provides time-predictable memory load and store instructions. Single-path programming avoids control flow dependent timing variations.  ...  Chip-level multi-threading for up to six threads eliminates the need for data forwarding, pipeline stalling, and branch prediction.  ...  Besides its support for predictability, our planning-based approach allows for the following optimizations of the TDMA schedules for global memory accesses.  ... 
doi:10.1007/978-3-642-10265-3_5 fatcat:7kd6i6uuxbcu3j2mgbsrc23kwi

Automatically exploiting cross-invocation parallelism using runtime information

Jialu Huang, T. B. Jablin, S. R. Beard, N. P. Johnson, D. I. August
2013 Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)  
Automatic parallelization is a promising approach to producing scalable multi-threaded programs for multicore architectures.  ...  These techniques use static analysis to partition iterations among threads to avoid crossthread dependences.  ...  Acknowledgments We thank the entire Liberty Research Group for their support and feedback during this work. We also thank the anonymous reviewers for their insightful comments and suggestions.  ... 
doi:10.1109/cgo.2013.6495001 dblp:conf/cgo/HuangJBJA13 fatcat:ctuqrto5knfppjaascyhexq36u

Customizing Instruction Set Extensible Reconfigurable Processors Using GPUs

Unmesh D. Bordoloi, Bharath Suri, Swaroop Nunna, Samarjit Chakraborty, Petru Eles, Zebo Peng
2012 2012 25th International Conference on VLSI Design  
However, most design automation algorithms for instruction set customization (like enumerating and selecting the optimal set of custom instructions) are computationally intractable.  ...  Experiments conducted on well known benchmarks show significant speedups over sequential CPU implementations as well as over multi-core implementations.  ...  In this paper, we have considered a more general problem formulation by focusing on multi-objective optimization instead of optimizing for a single objective.  ... 
doi:10.1109/vlsid.2012.107 dblp:conf/vlsid/BordoloiSNCEP12 fatcat:yg3wx23zszbh5jjeeigtktyyky

Automatic C-to-CUDA Code Generation for Affine Programs [chapter]

Muthu Manikandan Baskaran, J. Ramanujam, P. Sadayappan
2010 Lecture Notes in Computer Science  
that is optimized for efficient data access.  ...  CUDA (Compute Unified Device Architecture) provides a multi-threaded parallel programming model, facilitating high performance implementations of general-purpose computations.  ...  The C-to-CUDA Code Generation Framework warp scheduling through the CUDA runtime scheduler. Any warps whose next instruction has ready operands is eligible for execution.  ... 
doi:10.1007/978-3-642-11970-5_14 fatcat:euk4pngadbcrfdzheclqlslahu

A New Compiler for Space-Time Scheduling of ILP Processors

Rajendra Kumar, P. K. Singh
2011 International Journal of Computer and Electrical Engineering  
We want to achieve high level parallelism at faster clock speed it require distribution of processor resource and avoiding primitive that require single cycle global communication.  ...  The split line instruction problem depreciates this situation for x86 processors.  ...  Consequently, microprocessor increasingly support coarser thread based parallelism in the form of simultaneous multi threading (SMT) [6] and chip multi processing (CMP) [12] .  ... 
doi:10.7763/ijcee.2011.v3.375 fatcat:zkukpxggjvbv5nhacioojp6uma

Fast implementation of DGEMM on Fermi GPU

Guangming Tan, Linchuan Li, Sean Triechle, Everett Phillips, Yungang Bao, Ninghui Sun
2011 Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11  
Our optimizations include software pipelining, use of vector memory operations, and instruction scheduling. Our best CUDA algorithm achieves comparable performance with the latest CUBLAS library 1 .  ...  Our optimization strategy is further guided by a performance modeling based on micro-architecture benchmarks.  ...  That is, we only optimized latency hiding for shared memory accesses. • version 4: Based on version 3, we further optimized the latency hiding for global memory access using instruction scheduling optimization  ... 
doi:10.1145/2063384.2063431 dblp:conf/sc/TanLTPBS11 fatcat:4v6sakpdyzg5pnx6lyzb2ogrda
« Previous Showing results 1 — 15 out of 16,737 results