Filters








1,445 Hits in 2.9 sec

Modulo scheduling with cache reuse information [chapter]

Chen Ding, Steve Carr, Phil Sweany
1997 Lecture Notes in Computer Science  
In addition, we outline re nements to our simple reuse model that should allow modulo scheduling with reuse to achieve improved execution performance over the all-cache-miss assumption as well.  ...  Using a simple cache reuse model in our modulo scheduling software pipelining optimization, we a c hieved a bene t of 10% improved execution performance over assuming all-cache-hits and we used 18% fewer  ...  Section 4 details our experimental evaluation of modulo scheduling with reuse information, Section 5 describes re nements to our simple cache model that will allow further improvement o ver those shown  ... 
doi:10.1007/bfb0002856 fatcat:pwwoxzpoavhn7pvhzjjztjgd5q

Improving software pipelining with hardware support for self-spatial loads

Steve Carr, Philip Sweany
1999 SIGARCH Computer Architecture News  
Even with reuse information, references with a stride-one access pattern in the cache (called self-spatial loads) have been treated as all cache hits or all cache misses rather than as a single cache miss  ...  Recent work in software pipelining in the presence of uncertain memory latencies has shown that using compilergenerated cache-reuse analysis to determine proper load latencies can improve performance significantly  ...  Memoria first performs scalar replacement [10] for array references and then annotates Fortran code with the reuse information.  ... 
doi:10.1145/309758.309784 fatcat:icxaotwhbnbu7i6lfb65ubr3em

Clustered Modulo Scheduling in a VLIW Architecture with Distributed Cache

F. Jesús Sánchez, Antonio González
2001 Journal of Instruction-Level Parallelism  
A modulo scheduling scheme for this architecture is also proposed.  ...  The proposed algorithm produces schedules with very low communication requirements and outperforms previous cluster-oriented schedulers.  ...  The performance figures shown in this section refer to the modulo scheduling of innermost loops with a number of iterations greater than four and with no system call.  ... 
dblp:journals/jilp/SanchezG01 fatcat:mpgrbukrmnearmo6l22zeshrn4

Improving Software Pipelining by Hiding Memory Latency with Combined Loads and Prefetches [chapter]

Michael Bedy, Steve Carr, Soner Önder, Philip Sweany
2001 Interaction between Compilers and Computer Architectures  
Sánchez and González [25] describe a method for scheduling non-blocking loads called Cache Sensitive Modulo Scheduling (CSMS).  ...  In iterative modulo scheduling [23] , first a schedule of MinII instructions is attempted.  ... 
doi:10.1007/978-1-4757-3337-2_4 fatcat:u4ori3hcurcz5jbfgwjh22sdy4

A Data Prefetch and Reuse Strategy for Coarse-Grained Reconfigurable Architectures

Wei GE, Zhi QI, Yue DU, Lu MA, Longxing SHI
2013 IEICE transactions on information and systems  
To improve the data utilization efficiency, a dual-bank cache-like data reuse structure is proposed. Furthermore, a heuristic data prefetch is also introduced to decrease the data access latency.  ...  The HDPR strategy provides not only the flexible data access schedule but also the high data throughput needed to realize fast pipelined implementations of various loop kernels.  ...  We acknowledge the contributions of JiXin Zhang, Ren Chen and Wen Wen, all of whom have been associated with the development of the CGRA architecture.  ... 
doi:10.1587/transinf.e96.d.616 fatcat:hm667kne7raephymsfl3ypk23y

Software Data Prefetching for Software Pipelined Loops

Jesús Sánchez, Antonio González
1999 Journal of Parallel and Distributed Computing  
cache is considered).  ...  First, it is shown that evaluating software pipelined schedules without considering memory effects can be rather inaccurate due to stalls caused by dependences with memory instructions (even if a lockup-free  ...  We have shown that modulo scheduling schemes using cache-hit latency produce many stalls due to dependences with memory instructions.  ... 
doi:10.1006/jpdc.1999.1553 fatcat:472wggwkknantjizdyjkju7m5a

A methodology for speeding up loop kernels by exploiting the software information and the memory architecture

Vasilios Kelefouras, Angeliki Kritikakou, Costas Goutis
2015 Computer languages, systems & structures  
This methodology solves four of the major scheduling sub-problems, together as one problem and not separately; these are the sub-problems of finding the schedules with the minimum numbers of i) L1 data  ...  cache accesses, ii) L2 data cache accesses, iii) main memory data accesses, iv) addressing instructions.  ...  the cache modulo effect.  ... 
doi:10.1016/j.cl.2015.01.003 fatcat:giverm6gvvbtpcnjap77xjo3de

Exploring the design space of an optimized compiler approach for mesh-like coarse-grained reconfigurable architectures

G. Dimitroulakos, M.D. Galanis, C.E. Goutis
2006 Proceedings 20th IEEE International Parallel & Distributed Processing Symposium  
The processing elements' local register files and the processing elements' interconnection network is exploited for caching memory data values with data reuse opportunities.  ...  A novel mapping algorithm is also proposed that uses a modulo scheduling technique.  ...  The current work achieves better improvements with the proposed modulo scheduling technique.  ... 
doi:10.1109/ipdps.2006.1639349 dblp:conf/ipps/DimitroulakosGG06 fatcat:b5s6j57tvzgtvk5tabvb4shozq

Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking [article]

Zhe Jia, Marco Maggioni, Benjamin Staiger, Daniele P. Scarpazza
2018 arXiv   pre-print
to remain up-to-date with the technological advances at a microarchitectural level.  ...  To address this dearth of public, microarchitectural-level information on the novel NVIDIA GPUs, independent researchers have resorted to microbenchmarks-based dissection and discovery.  ...  Volta uses one 128-bit word to encode each instruction together with its corresponding control information.  ... 
arXiv:1804.06826v1 fatcat:obbd5jmwebcxxa7gifbvjeecx4

Using profile information to assist advanced compiler optimization and scheduling [chapter]

W. Chen, R. Bringmann, S. Mahlke, S. Anik, T. Kiyohara, N. Warter, D. Lavery, W. -M. Hwu, R. Hank, J. Gyllenhaal
1993 Lecture Notes in Computer Science  
These transformations include global optimization, acyclic global scheduling, and software pipelining.  ...  Pro le information identi es these important execution sequences in a program. In this paper, two major categories of pro le information are studied: control-ow and memory-dependence.  ...  Figure 3 : 3 Modulo scheduling with modi ed hierarchical reduction using pro le-based optimization, a weighted control-ow graph, b data dependence graph, c modulo schedule of A-B-E-F, d kernel schedule  ... 
doi:10.1007/3-540-57502-2_38 fatcat:eu76e3a255alxdl6glixj5fi6m

An integrated and automated memory optimization flow for FPGA behavioral synthesis

Yuxin Wang, Peng Zhang, Xu Cheng, Jason Cong
2012 17th Asia and South Pacific Design Automation Conference  
We develop memory padding to help in the memory partitioning of indices with modulo operations.  ...  In this paper we integrate data reuse, loop pipelining, memory partitioning, and memory merging into an automated optimization flow (AMO) for FPGA behavioral synthesis.  ...  Preobtained scheduling results provide information for access conflict analysis among reuse buffers.  ... 
doi:10.1109/aspdac.2012.6164955 dblp:conf/aspdac/WangZCC12 fatcat:utp3igrbw5dvjd6lwsszjlhgxy

RDGC: A Reuse Distance-Based Approach to GPU Cache Performance Analysis

Mohsen Kiani, Amir Rajabzadeh
2019 Computing and informatics  
Further, reuse distance analysis is extended to model the multi-partition/multi-port parallel caches and employed by RDGC to analyze GPU cache memories.  ...  In the present paper, we propose RDGC, a reuse distance-based performance analysis approach for GPU cache hierarchy.  ...  They provide reuse distance breakdown calculated from the memory access information generated by GPGPU-sim.  ... 
doi:10.31577/cai_2019_2_421 fatcat:xpigvadvfvcvxndaryqxhw5ena

Data locality and load balancing in COOL

Rohit Chandra, Anoop Gupta, John L. Hennessy
1993 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming - PPOPP '93  
Large-scale shared memory multiprocessors typically support a multilevel memory hiermchy consisting of per-processor caches, a locat portion of shared memory, and remote shared memory.  ...  This information is used by the runtime system to distribute tasks end objects so that the tasks execute close (in the memory hierarchy) to the objects they reference.  ...  Determining where to schedule a task simply requires two modulo operations.  ... 
doi:10.1145/155332.155358 dblp:conf/ppopp/ChandraGH93 fatcat:sikpbiodazcphaxy745flcrway

Data locality and load balancing in COOL

Rohit Chandra, Anoop Gupta, John L. Hennessy
1993 SIGPLAN notices  
Large-scale shared memory multiprocessors typically support a multilevel memory hiermchy consisting of per-processor caches, a locat portion of shared memory, and remote shared memory.  ...  This information is used by the runtime system to distribute tasks end objects so that the tasks execute close (in the memory hierarchy) to the objects they reference.  ...  Determining where to schedule a task simply requires two modulo operations.  ... 
doi:10.1145/173284.155358 fatcat:jh4rloeflfb6niuichtpnnu5je

A detailed GPU cache model based on reuse distance theory

Cedric Nugteren, Gert-Jan van den Braak, Henk Corporaal, Henri Bal
2014 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)  
However, dynamic warp scheduling techniques might rely on details not available to the cache model, such as branch and warp divergence information.  ...  This requires us to embed information about the cache size in the model, making the reuse distance profile no longer cache-size independent.  ...  This is important because the amount of compulsory misses is cache parameter independent.  ... 
doi:10.1109/hpca.2014.6835955 dblp:conf/hpca/NugterenBCB14 fatcat:473kmw2jk5ablnyo4trhcc6pd4
« Previous Showing results 1 — 15 out of 1,445 results