Filters








1,075 Hits in 4.9 sec

Efficient Synchronization for Stencil Computations Using Dynamic Task Graphs

Zubair Wadood Bhatti, Roel Wuyts, Pascal Costanza, Davy Preuveneers, Yolande Berbers
2013 Procedia Computer Science  
This paper presents a novel approach for reducing the synchronization overhead of stencil computations by leveraging dynamic task graphs to avoid global barriers and minimizing spin-waiting, and exploiting  ...  Executing stencil computations constitutes a significant portion of execution time for many numerical simulations running on high performance computing systems.  ...  Acknowledgements This research is partially funded by the Research Fund KU Leuven, Intel and the Institute for the Promotion of Innovation through Science and Technology in Flanders (IWT).  ... 
doi:10.1016/j.procs.2013.05.416 fatcat:vyn4ouduvvaybn6kxmefqvjicu

DASH: Distributed Data Structures and Parallel Algorithms in a Global Address Space [chapter]

Karl Fürlinger, José Gracia, Andreas Knüpfer, Tobias Fuchs, Denis Hünich, Pascal Jungblut, Roger Kowalewski, Joseph Schuchart
2020 Lecture Notes in Computational Science and Engineering  
This article describes recent developments in the context of DASH concerning the ability to execute tasks with remote dependencies, the exploitation of dynamic hardware locality, smart data structures,  ...  We would also like to thank the German research foundation (DFG) for the funding received through the SPPEXA priority programme and initiators and managers of SPPEXA for their foresight and level-headed  ...  In contrast to that, the dynamic task discovery (DTD) frontend of PaRSEC dynamically discovers the global task-graph, i.e., each process is aware of all nodes and edges in the graph.  ... 
doi:10.1007/978-3-030-47956-5_6 fatcat:44avzbgnkvh73iriqceboti4wu

Scalable Fine-Grained Metric-Based Remeshing Algorithm for Manycore/NUMA Architectures [chapter]

Hoby Rakotoarivelo, Franck Ledoux, Franck Pommereau, Nicolas Le-Goff
2017 Lecture Notes in Computer Science  
In this context, we devise a multi-stage algorithm in which a task graph is built for each kernel.  ...  In addition to index ranges precalculation, a dual-step atomic-based synchronization scheme is used for nodal data updates.  ...  We use a fine-grained maximal graph matching heuristic for task extraction in the swapping kernel.  ... 
doi:10.1007/978-3-319-64203-1_43 fatcat:dxht6opnprev5pr5fheofeahye

From DSL to HPC component-based runtime

Julien Bigot, Hélène Coullon, Christian Pérez
2015 Proceedings of the 5th International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing - WOLFHPC '15  
Such architectures also usually become more difficult to use efficiently.  ...  To study it, the paper presents a DSL for multi-stencil programs, that is evaluated on a real-case of shallow water equations.  ...  The Minimal Series-Parallel Graph: Γmsp Once a dependency graph is built, many solutions can be used to build a parallel application, as for example dynamic schedulers [2, 13] .  ... 
doi:10.1145/2830018.2830020 dblp:conf/sc/BigotCP15 fatcat:7n4quhtayfebzkzwcrki6ar5ve

Extensibility and Composability of a Multi-Stencil Domain Specific Framework

Hélène Coullon, Julien Bigot, Christian Perez
2017 International journal of parallel programming  
For example, this phenomenon occurs for stencil-based numerical simulations, for which a large number of languages has been proposed without code reuse between them.  ...  The Multi-Stencil Framework (MSF) presented in this paper combines a new DSL to component-based programming models to enhance code reuse and separation of concerns in the specific case of stencils.  ...  A dependency graph exhibits parallel tasks, or on the contrary sequential execution of tasks. Such a dependency graph can directly be given to a dynamic scheduler, or can statically be scheduled.  ... 
doi:10.1007/s10766-017-0539-5 fatcat:jtlruljuyjdpdjyuvhxjjz7ho4

Fine-grained Locality-aware Parallel Scheme for Anisotropic Mesh Adaptation

Hoby Rakotoarivelo, Franck Ledoux, Franck Pommereau
2016 Procedia Engineering  
Tasks are structured into bulk-synchronous steps to avoid data races and to aggregate shared-data accesses.  ...  Data dependencies are expressed by a graph for each kernel, and concurrency is extracted through fine-grained graph coloring.  ...  Acknowledgement A special thanks to Nicolas Le-Goff for his assistance through all steps of this work.  ... 
doi:10.1016/j.proeng.2016.11.035 fatcat:zom63bc53bc43mn42dbsnebvuq

Optimizing Communication Scheduling Using Dataflow Semantics

Adrian Soviani, Jaswinder Pal Singh
2009 2009 International Conference on Parallel Processing  
Communication and synchronization are added automatically and optimized for specific architectures, relieving programmers of this task.  ...  These include exposing communication overlap by decreasing task grain, and aggregating communication by replicating data and computation.  ...  In Cilk tasks are explicitly spawned and synchronized in a recursive fashion, each task accessing global datastructures [18] . Load balancing is dynamic implemented via work stealing.  ... 
doi:10.1109/icpp.2009.66 dblp:conf/icpp/SovianiS09 fatcat:muszx6yydnbovlafdjzpleblwy

Coloured and task-based stencil codes [article]

Benjamin Hazelwood, Tobias Weinzierl
2018 arXiv   pre-print
We evaluate traditional multithreading strategies on both Broadwell and KNL, study the arising assignment of tasks to threads and, from there, derive two efficient ways to parallelise stencil codes on  ...  Simple stencil codes are and remain an important building block in scientific computing. On shared memory nodes, they are traditionally parallelised through colouring or (recursive) tiling.  ...  This approach pays off for very small problem sizes and very sparse dependencies only. Notably for dynamically changing, block-structured grids, this is an important use case.  ... 
arXiv:1810.04033v1 fatcat:2an47dserzcodkeahdwderggxa

Using GPU's to accelerate stencil-based computation kernels for the development of large scale scientific applications on heterogeneous systems

Jian Tao, Marek Blazewicz, Steven R. Brandt
2012 Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming - PPoPP '12  
We present CaCUDA -a GPGPU kernel abstraction and a parallel programming framework for developing highly efficient large scale scientific applications using stencil computations on hybrid CPU/GPU architectures  ...  CaCUDA is built upon the Cactus computational toolkit, an open source problem solving environment designed for scientists and engineers.  ...  This work was performed using the computational resources of LSU/LONI and was supported by the Center for Computation and Technology at LSU. This work was also supported by  ... 
doi:10.1145/2145816.2145857 dblp:conf/ppopp/TaoBB12 fatcat:btvpqjgi6rakpef2yv5l2rx27y

The Implementation of ASSIST, an Environment for Parallel and Distributed Programming [chapter]

Marco Aldinucci, Sonia Campa, Pierpaolo Ciullo, Massimo Coppola, Silvia Magini, Paolo Pesciullesi, Laura Potiti, Roberto Ravazzolo, Massimo Torquati, Marco Vanneschi, Corrado Zoccolo
2003 Lecture Notes in Computer Science  
We describe the implementation of ASSIST, a programming environment for parallel and distributed programs.  ...  Although some support optimization are still missing, test results in Fig. 4 -left show that the SMU support for dynamically computed stencil patterns is almost as efficient as that of static (unchanging  ...  The Task Code graph is also a target for performance modelling and optimization.  ... 
doi:10.1007/978-3-540-45209-6_100 fatcat:nafzltg5o5h7bl3tkmcxzlb6su

The Performance Implication of Task Size for Applications on the HPX Runtime System

Patricia Grubel, Hartmut Kaiser, Jeanine Cook, Adrian Serio
2015 2015 IEEE International Conference on Cluster Computing  
We focus our study using a task-based runtime system, one possible solution towards Exascale computation. Based on task size and scheduler, the overheads associated with task scheduling vary.  ...  Using the performance counter capabilities in HPX, we characterize task scheduling overheads and show metrics to determine optimal task size.  ...  We also thank the anonymous reviewers for their insightful recommendations.  ... 
doi:10.1109/cluster.2015.119 dblp:conf/cluster/GrubelKCS15 fatcat:vjxornmsvvdengb36gfcvh4tri

Mapping Stencils on Coarse-grained Reconfigurable Spatial Architecture [article]

Jesmin Jahan Tithi, Fabrizio Petrini, Hongbo Rong, Andrei Valentin, Carl Ebeling
2021 arXiv   pre-print
How to efficiently map a stencil computation to a CGRA is the key to performance.  ...  Therefore, it has been always important to optimize stencil programs for the best performance.  ...  Fig. 5 . 5 Data-flow graph for the compute workers. Fig. 8 . 8 A 5 point 2D Stencil.  ... 
arXiv:2011.05160v2 fatcat:toy66ieltbg57bdnke7d7eomda

The Suzaku Pattern Programming Framework

Barry Wilkinson, Clayton Ferner
2016 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)  
The focus for developing Suzaku is on teaching parallel programming. This paper covers the main features of Suzaku and describes our experiences using it in parallel programming classes.  ...  11 .Figure 12 . 131112 Connection graph for the stencil pattern.  ...  SZ_Pattern_init() initializes the connection graph for a standard pattern (all-to-all, pipeline, stencil). The routine compute() in Fig. 12 is executed by each slave.  ... 
doi:10.1109/ipdpsw.2016.107 dblp:conf/ipps/WilkinsonF16 fatcat:6cqhscxbj5ezllz3repcvl2lsq

Enabling OpenMP Task Parallelism on Multi-FPGAs [article]

R. Nepomuceno, R. Sterle, G. Valarini, M. Pereira, H. Yviquel, G. Araujo
2021 arXiv   pre-print
efficiency.  ...  This paper extends the OpenMP task-based computation offloading model to enable a number of FPGAs to work together as a single Multi-FPGA architecture.  ...  The OpenMP runtime manages the task graph, handles data management, creates and synchronizes threads, among other activities.  ... 
arXiv:2103.10573v2 fatcat:qjfa6bwkszfphmdy3mghbozjcy

Reducing the burden of parallel loop schedulers for many‐core processors

Mahwish Arif, Hans Vandierendonck
2021 Concurrency and Computation  
The loops already execute on 48 threads (see Section 5 for details on the platform) using the Intel Cilkplus runtime. 5 The dynamic range of loop duration is very high, ranging from submicrosecond to tens  ...  Compiler support enables efficient reductions for Cilk, without changing the programming interface of Cilk reducers.  ...  In this work, the burden is reduced by simplifying the scheduling algorithm and by using highly efficient synchronization operations.  ... 
doi:10.1002/cpe.6241 fatcat:4rluruunxjb4dehant4kjl354e
« Previous Showing results 1 — 15 out of 1,075 results