Filters








549 Hits in 6.7 sec

Cache Optimization for Coarse Grain Task Parallel Processing Using Inter-Array Padding [chapter]

Kazuhisa Ishizaka, Motoki Obata, Hironori Kasahara
2004 Lecture Notes in Computer Science  
This paper describes inter-array padding to minimize cache conflict misses among macro-tasks with data localization scheme which decomposes loops sharing the same arrays to fit cache size and executes  ...  In multigrain parallelization, coarse grain task parallelism among loops and subroutines and near fine grain parallelism among statements are used in addition to the traditional loop parallelism.  ...  Acknowledgments This research is supported by METI/NEDE millennium project IT21 "advanced Parallelizing Compiler" and STARC (Semiconductor Technology Academic Research Center).  ... 
doi:10.1007/978-3-540-24644-2_5 fatcat:27gypcyrobhibbgjn6ddcnvr5y

Performance of OSCAR Multigrain Parallelizing Compiler on SMP Servers [chapter]

Kazuhisa Ishizaka, Takamichi Miyamoto, Jun Shirako, Motoki Obata, Keiji Kimura, Hironori Kasahara
2005 Lecture Notes in Computer Science  
Also, it allows us global cache optimization over different loops, or coarse grain tasks, based on data localization technique with interarray padding to reduce memory access overhead.  ...  The OS-CAR compiler hierarchically exploits the coarse grain task parallelism among loops, subroutines and basic blocks and the near fine grain parallelism among statements inside a basic block in addition  ...  Also, the auhours thank to NEC soft, Ltd. and SGI Japan, Ltd. for the kind offer of the use of the NEC TX7/i6010 and SGI Altix 3700 System for this research.  ... 
doi:10.1007/11532378_23 fatcat:ipm637l2brevhi5ycvdbshkeby

Automatic Coarse Grain Task Parallel Processing on SMP Using OpenMP [chapter]

Hironori Kasahara, Motoki Obata, Kazuhisa Ishizaka
2001 Lecture Notes in Computer Science  
based on hierarchical coarse grain task parallel processing concept.  ...  This paper proposes a simple and efficient implementation method for a hierarchical coarse grain task parallel processing scheme on a SMP machine.  ...  Implementation of coarse grain task parallel processing using OpenMP This section describes an implementation method of the coarse grain task parallel processing using OpenMP for SMP machines.  ... 
doi:10.1007/3-540-45574-4_13 fatcat:lqmfuzm7jvgkhomgv5fg7cnzmy

Hierarchical Parallelism Control for Multigrain Parallel Processing [chapter]

Motoki Obata, Jun Shirako, Hiroki Kaminaga, Kazuhisa Ishizaka, Hironori Kasahara
2005 Lecture Notes in Computer Science  
To improve effective performance and usability of shared memory multiprocessor systems, a multi-grain compilation scheme, which hierarchically exploits coarse grain parallelism among loops, subroutines  ...  In order to efficiently use hierarchical parallelism of each nest level, or layer, in multigrain parallel processing, it is required to determine how many processors or groups of processors should be assigned  ...  and the coarse grain task parallel processing time by OSCAR compiler using 8 processors are shown for each SPEC95FP programs.  ... 
doi:10.1007/11596110_3 fatcat:3krnjuzrlbcehni2xviqj5conu

Reducing task creation and termination overhead in explicitly parallel programs

Jisheng Zhao, Jun Shirako, V. Krishna Nandivada, Vivek Sarkar
2010 Proceedings of the 19th international conference on Parallel architectures and compilation techniques - PACT '10  
The original benchmarks in this study were written with medium-grained parallelism; a larger relative improvement can be expected for programs written with finer-grained parallelism.  ...  However, even for the medium-grained parallel benchmarks studied in this paper, the significant improvement obtained by the transformation framework underscores the importance of the compiler optimizations  ...  Finally, we would like to thank the anonymous reviewers for their comments and suggestions, and Doug Lea for providing access to the UltraSPARC T2 SMP system used to obtain the performance results reported  ... 
doi:10.1145/1854273.1854298 dblp:conf/IEEEpact/ZhaoSNS10 fatcat:4ipfg5pnwjeuteyao2sj3y7ggu

An Implementation of Multiple-Standard Video Decoder on a Mixed-Grained Reconfigurable Computing Platform

Leibo LIU, Dong WANG, Yingjie CHEN, Min ZHU, Shouyi YIN, Shaojun WEI
2016 IEICE transactions on information and systems  
The proposed RPU, including 16 × 16 multi-functional processing elements (PEs), is used to accelerate computeintensive tasks in the video decoding.  ...  This paper presents the design of a multiple-standard 1080 high definition (HD) video decoder on a mixed-grained reconfigurable computing platform integrating coarse-grained reconfigurable processing units  ...  For instance, Sterpone [6] proposed an analytical model for analyzing the tradeoff between fine-grained processing tasks and coarse-grained tasks that should be implemented on different hardware architectures  ... 
doi:10.1587/transinf.2015edp7369 fatcat:4ixd2sywvvfv5izvwwhpzwl5xe

Reconfiguration Process Optimization of Dynamically Coarse Grain Reconfigurable Architecture for Multimedia Applications

Bo LIU, Peng CAO, Min ZHU, Jun YANG, Leibo LIU, Shaojun WEI, Longxing SHI
2012 IEICE transactions on information and systems  
This paper presents a novel architecture design to optimize the reconfiguration process of a coarse-grained reconfigurable architecture (CGRA) called Reconfigurable Multimedia System II (REMUS-II).  ...  The optimization methods include two aspects: the multi-target reconfiguration method and the configuration caching strategies.  ...  In contrast to fine-grained RAs, coarse-grained RAs (CGRAs) use word-length function units such as multipliers and arithmetic logic units.  ... 
doi:10.1587/transinf.e95.d.1858 fatcat:md4rzrvnsvf37f5r6duuy2gb64

Combined partitioning and data padding for scheduling multiple loop nests

Zhong Wang, Edwin H.-M. Sha, Xiaobo (Sharon) Hu
2001 Proceedings of the international conference on Compilers, architecture, and synthesis for embedded systems - CASES '01  
Data padding is applied in our technique to eliminate the cache interference, which overcomes the problem of cache conflict misses arisen from loop partition.  ...  With the widening performance gap between processors and main memory, efficient memory accessing behavior is necessary for good program performance.  ...  Inter-variable padding can be used to eliminate the crossinterference between different arrays. The pad size should be selected such that no two arrays conflict in the cache.  ... 
doi:10.1145/502225.502228 fatcat:nge25mjqibecfiganrifyi46hq

Combined partitioning and data padding for scheduling multiple loop nests

Zhong Wang, Edwin H.-M. Sha, Xiaobo (Sharon) Hu
2001 Proceedings of the international conference on Compilers, architecture, and synthesis for embedded systems - CASES '01  
Data padding is applied in our technique to eliminate the cache interference, which overcomes the problem of cache conflict misses arisen from loop partition.  ...  With the widening performance gap between processors and main memory, efficient memory accessing behavior is necessary for good program performance.  ...  Inter-variable padding can be used to eliminate the crossinterference between different arrays. The pad size should be selected such that no two arrays conflict in the cache.  ... 
doi:10.1145/502217.502228 dblp:conf/cases/WangSH01 fatcat:33bvcvzgpnbytmwz24p5svcuum

A Survey on Hardware and Software Support for Thread Level Parallelism [article]

Somnath Mazumdar, Roberto Giorgi
2016 arXiv   pre-print
Hardware support at execution time is very crucial to the performance of the system, thus different types of hardware support for threads also exist or have been proposed, primarily based on widely used  ...  Todays computers are built upon multiple processing cores and run applications consisting of a large number of threads, making runtime thread management a complex process.  ...  In coarse-grain multithreading, switching to other threads only happens when there is a long I/O stall(s) (e.g., a cache miss) [IGHJS95].  ... 
arXiv:1603.09274v3 fatcat:75isdvgp5zbhplocook6273sq4

Cache-Oblivious parallel SIMD Viterbi decoding for sequence search in HMMER

Miguel Ferreira, Nuno Roma, Luis MS Russo
2014 BMC Bioinformatics  
Results: A new SIMD vectorization of the Viterbi decoding algorithm is proposed, based on an SSE2 inter-task parallelization approach similar to the DNA alignment algorithm proposed by Rognes.  ...  One of its main homology engines is based on the Viterbi decoding algorithm, which was already highly parallelized and optimized using Farrar's striped processing pattern with Intel SSE2 instruction set  ...  Contrasting to Farrar's, which was based on the exploitation of intra-task parallelism, Rognes' method also makes use of SSE2 vector processing but exploits an inter-task parallelism scheme (i.e., multiple  ... 
doi:10.1186/1471-2105-15-165 pmid:24884826 pmcid:PMC4229909 fatcat:uchrul564fgerlhwji6bgxytyq

AceMesh: a structured data driven programming language for high performance computing

Li Chen, Shenglin Tang, You Fu, Xiran Gao, Jie Guo, Shangzhi Jiang
2020 CCF Transactions on High Performance Computing  
Its language features include data-centric parallelizing template, aggregated task dependence for parallel loops.  ...  , and reducing system complexity incurred by complex array sections.  ...  to computing resources, and optimizing inter-node communications.  ... 
doi:10.1007/s42514-020-00047-4 fatcat:5d6q663fuffr7fma3kqrmjffl4

Design of a parallel AES for graphics hardware using the CUDA framework

Andrea Di Biagio, Alessandro Barenghi, Giovanni Agosta, Gerardo Pelosi
2009 2009 IEEE International Symposium on Parallel & Distributed Processing  
With respect to previous works, we focus on optimizing the implementation for practical application scenarios, and we provide a throughput improvement of over 14 times.  ...  The encryption activity is computationally intensive, and exposes a significant degree of parallelism.  ...  The parameter space considered for the AES algorithm implementation in the experimental campaign is summed up as follows: • Kind of parallelism: either fine-grained or coarse-grained. • T-box memory allocation  ... 
doi:10.1109/ipdps.2009.5161242 dblp:conf/ipps/BiagioBAP09 fatcat:kxjc46vg45astkmmxxsunpcqge

Study and evaluation of an Irregular Graph Algorithm on Multicore and GPU Processor Architectures [article]

Varun Nagpal
2016 arXiv   pre-print
The experimental results obtained on Intel Multicore Xeon system shows performance speedups (w.r.t baseline sequential) of maximum 56x , average 33x and minimum 8.3x for real world graph data sets.  ...  In terms of raw performance, for the graph data set called Patents network, our results on Intel Xeon Multicore(16 hw threads) were 1.27x times faster than previous results on Cray XMT(16 hw threads) while  ...  Also the task interactions are generally insignificant in data parallel applications making them more amenable to coarse grained parallelism.  ... 
arXiv:1603.02655v1 fatcat:nklt3op66vdfdpmd3ygeckhwla

Chapter 5. Realistic Computer Models [chapter]

Deepak Ajwani, Henning Meyerhenke
2010 Lecture Notes in Computer Science  
This can be avoided by inserting a pad, i. e., an allocated, but unused array of suitable size to change the offset of the second array, between the two conflicting arrays (inter-array padding).  ...  Coarse-grained Parallel Simulation Results The simulations of coarse-grained parallel algorithms shown in this section resemble the PRAM simulation.  ... 
doi:10.1007/978-3-642-14866-8_5 fatcat:j326q2ymeffzfmo36nqst7msmq
« Previous Showing results 1 — 15 out of 549 results