A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2017; you can also visit the original URL.
The file type is application/pdf
.
Filters
Cache Optimization for Coarse Grain Task Parallel Processing Using Inter-Array Padding
[chapter]
2004
Lecture Notes in Computer Science
This paper describes inter-array padding to minimize cache conflict misses among macro-tasks with data localization scheme which decomposes loops sharing the same arrays to fit cache size and executes ...
In multigrain parallelization, coarse grain task parallelism among loops and subroutines and near fine grain parallelism among statements are used in addition to the traditional loop parallelism. ...
Acknowledgments This research is supported by METI/NEDE millennium project IT21 "advanced Parallelizing Compiler" and STARC (Semiconductor Technology Academic Research Center). ...
doi:10.1007/978-3-540-24644-2_5
fatcat:27gypcyrobhibbgjn6ddcnvr5y
Performance of OSCAR Multigrain Parallelizing Compiler on SMP Servers
[chapter]
2005
Lecture Notes in Computer Science
Also, it allows us global cache optimization over different loops, or coarse grain tasks, based on data localization technique with interarray padding to reduce memory access overhead. ...
The OS-CAR compiler hierarchically exploits the coarse grain task parallelism among loops, subroutines and basic blocks and the near fine grain parallelism among statements inside a basic block in addition ...
Also, the auhours thank to NEC soft, Ltd. and SGI Japan, Ltd. for the kind offer of the use of the NEC TX7/i6010 and SGI Altix 3700 System for this research. ...
doi:10.1007/11532378_23
fatcat:ipm637l2brevhi5ycvdbshkeby
Automatic Coarse Grain Task Parallel Processing on SMP Using OpenMP
[chapter]
2001
Lecture Notes in Computer Science
based on hierarchical coarse grain task parallel processing concept. ...
This paper proposes a simple and efficient implementation method for a hierarchical coarse grain task parallel processing scheme on a SMP machine. ...
Implementation of coarse grain task parallel processing using OpenMP This section describes an implementation method of the coarse grain task parallel processing using OpenMP for SMP machines. ...
doi:10.1007/3-540-45574-4_13
fatcat:lqmfuzm7jvgkhomgv5fg7cnzmy
Hierarchical Parallelism Control for Multigrain Parallel Processing
[chapter]
2005
Lecture Notes in Computer Science
To improve effective performance and usability of shared memory multiprocessor systems, a multi-grain compilation scheme, which hierarchically exploits coarse grain parallelism among loops, subroutines ...
In order to efficiently use hierarchical parallelism of each nest level, or layer, in multigrain parallel processing, it is required to determine how many processors or groups of processors should be assigned ...
and the coarse grain task parallel processing time by OSCAR compiler using 8 processors are shown for each SPEC95FP programs. ...
doi:10.1007/11596110_3
fatcat:3krnjuzrlbcehni2xviqj5conu
Reducing task creation and termination overhead in explicitly parallel programs
2010
Proceedings of the 19th international conference on Parallel architectures and compilation techniques - PACT '10
The original benchmarks in this study were written with medium-grained parallelism; a larger relative improvement can be expected for programs written with finer-grained parallelism. ...
However, even for the medium-grained parallel benchmarks studied in this paper, the significant improvement obtained by the transformation framework underscores the importance of the compiler optimizations ...
Finally, we would like to thank the anonymous reviewers for their comments and suggestions, and Doug Lea for providing access to the UltraSPARC T2 SMP system used to obtain the performance results reported ...
doi:10.1145/1854273.1854298
dblp:conf/IEEEpact/ZhaoSNS10
fatcat:4ipfg5pnwjeuteyao2sj3y7ggu
An Implementation of Multiple-Standard Video Decoder on a Mixed-Grained Reconfigurable Computing Platform
2016
IEICE transactions on information and systems
The proposed RPU, including 16 × 16 multi-functional processing elements (PEs), is used to accelerate computeintensive tasks in the video decoding. ...
This paper presents the design of a multiple-standard 1080 high definition (HD) video decoder on a mixed-grained reconfigurable computing platform integrating coarse-grained reconfigurable processing units ...
For instance, Sterpone [6] proposed an analytical model for analyzing the tradeoff between fine-grained processing tasks and coarse-grained tasks that should be implemented on different hardware architectures ...
doi:10.1587/transinf.2015edp7369
fatcat:4ixd2sywvvfv5izvwwhpzwl5xe
Reconfiguration Process Optimization of Dynamically Coarse Grain Reconfigurable Architecture for Multimedia Applications
2012
IEICE transactions on information and systems
This paper presents a novel architecture design to optimize the reconfiguration process of a coarse-grained reconfigurable architecture (CGRA) called Reconfigurable Multimedia System II (REMUS-II). ...
The optimization methods include two aspects: the multi-target reconfiguration method and the configuration caching strategies. ...
In contrast to fine-grained RAs, coarse-grained RAs (CGRAs) use word-length function units such as multipliers and arithmetic logic units. ...
doi:10.1587/transinf.e95.d.1858
fatcat:md4rzrvnsvf37f5r6duuy2gb64
Combined partitioning and data padding for scheduling multiple loop nests
2001
Proceedings of the international conference on Compilers, architecture, and synthesis for embedded systems - CASES '01
Data padding is applied in our technique to eliminate the cache interference, which overcomes the problem of cache conflict misses arisen from loop partition. ...
With the widening performance gap between processors and main memory, efficient memory accessing behavior is necessary for good program performance. ...
Inter-variable padding can be used to eliminate the crossinterference between different arrays. The pad size should be selected such that no two arrays conflict in the cache. ...
doi:10.1145/502225.502228
fatcat:nge25mjqibecfiganrifyi46hq
Combined partitioning and data padding for scheduling multiple loop nests
2001
Proceedings of the international conference on Compilers, architecture, and synthesis for embedded systems - CASES '01
Data padding is applied in our technique to eliminate the cache interference, which overcomes the problem of cache conflict misses arisen from loop partition. ...
With the widening performance gap between processors and main memory, efficient memory accessing behavior is necessary for good program performance. ...
Inter-variable padding can be used to eliminate the crossinterference between different arrays. The pad size should be selected such that no two arrays conflict in the cache. ...
doi:10.1145/502217.502228
dblp:conf/cases/WangSH01
fatcat:33bvcvzgpnbytmwz24p5svcuum
A Survey on Hardware and Software Support for Thread Level Parallelism
[article]
2016
arXiv
pre-print
Hardware support at execution time is very crucial to the performance of the system, thus different types of hardware support for threads also exist or have been proposed, primarily based on widely used ...
Todays computers are built upon multiple processing cores and run applications consisting of a large number of threads, making runtime thread management a complex process. ...
In coarse-grain multithreading, switching to other threads only happens when there is a long I/O stall(s) (e.g., a cache miss) [IGHJS95]. ...
arXiv:1603.09274v3
fatcat:75isdvgp5zbhplocook6273sq4
Cache-Oblivious parallel SIMD Viterbi decoding for sequence search in HMMER
2014
BMC Bioinformatics
Results: A new SIMD vectorization of the Viterbi decoding algorithm is proposed, based on an SSE2 inter-task parallelization approach similar to the DNA alignment algorithm proposed by Rognes. ...
One of its main homology engines is based on the Viterbi decoding algorithm, which was already highly parallelized and optimized using Farrar's striped processing pattern with Intel SSE2 instruction set ...
Contrasting to Farrar's, which was based on the exploitation of intra-task parallelism, Rognes' method also makes use of SSE2 vector processing but exploits an inter-task parallelism scheme (i.e., multiple ...
doi:10.1186/1471-2105-15-165
pmid:24884826
pmcid:PMC4229909
fatcat:uchrul564fgerlhwji6bgxytyq
AceMesh: a structured data driven programming language for high performance computing
2020
CCF Transactions on High Performance Computing
Its language features include data-centric parallelizing template, aggregated task dependence for parallel loops. ...
, and reducing system complexity incurred by complex array sections. ...
to computing resources, and optimizing inter-node communications. ...
doi:10.1007/s42514-020-00047-4
fatcat:5d6q663fuffr7fma3kqrmjffl4
Design of a parallel AES for graphics hardware using the CUDA framework
2009
2009 IEEE International Symposium on Parallel & Distributed Processing
With respect to previous works, we focus on optimizing the implementation for practical application scenarios, and we provide a throughput improvement of over 14 times. ...
The encryption activity is computationally intensive, and exposes a significant degree of parallelism. ...
The parameter space considered for the AES algorithm implementation in the experimental campaign is summed up as follows: • Kind of parallelism: either fine-grained or coarse-grained. • T-box memory allocation ...
doi:10.1109/ipdps.2009.5161242
dblp:conf/ipps/BiagioBAP09
fatcat:kxjc46vg45astkmmxxsunpcqge
Study and evaluation of an Irregular Graph Algorithm on Multicore and GPU Processor Architectures
[article]
2016
arXiv
pre-print
The experimental results obtained on Intel Multicore Xeon system shows performance speedups (w.r.t baseline sequential) of maximum 56x , average 33x and minimum 8.3x for real world graph data sets. ...
In terms of raw performance, for the graph data set called Patents network, our results on Intel Xeon Multicore(16 hw threads) were 1.27x times faster than previous results on Cray XMT(16 hw threads) while ...
Also the task interactions are generally insignificant in data parallel applications making them more amenable to coarse grained parallelism. ...
arXiv:1603.02655v1
fatcat:nklt3op66vdfdpmd3ygeckhwla
Chapter 5. Realistic Computer Models
[chapter]
2010
Lecture Notes in Computer Science
This can be avoided by inserting a pad, i. e., an allocated, but unused array of suitable size to change the offset of the second array, between the two conflicting arrays (inter-array padding). ...
Coarse-grained Parallel Simulation Results The simulations of coarse-grained parallel algorithms shown in this section resemble the PRAM simulation. ...
doi:10.1007/978-3-642-14866-8_5
fatcat:j326q2ymeffzfmo36nqst7msmq
« Previous
Showing results 1 — 15 out of 549 results