5,415 Hits in 4.8 sec

A Simple, Pipelined Algorithm for Large, Irregular All-gather Problems [chapter]

Jesper Larsson Träff, Andreas Ripke, Christian Siebert, Pavan Balaji, Rajeev Thakur, William Gropp
2008 Lecture Notes in Computer Science  
We present and evaluate a new, simple, pipelined algorithm for large, irregular all-gather problems, useful for the implementation of the MPI Allgatherv collective operation of MPI.  ...  The algorithm can be viewed as an adaptation of a linear ring algorithm for regular allgather problems for single-ported, clustered multiprocessors to the irregular problem.  ...  The pipelined algorithm performs even better on this machine. Concluding Remarks We described a simple, pipelined ring algorithm for large, irregular all-gather problems.  ... 
doi:10.1007/978-3-540-87475-1_16 fatcat:7vyec2gtgbgtnjfmfuylukexou

A Pipelined Algorithm for Large, Irregular All-Gather Problems

Jesper Larsson Träff, Andreas Ripke, Christian Siebert, Pavan Balaji, Rajeev Thakur, William Gropp
2010 The international journal of high performance computing applications  
We describe and evaluate a new, pipelined algorithm for large, irregular all-gather problems.  ...  . * This paper is a revised version of the conference presentation "A Simple, Pipelined Algorithm for Large, Irregular All-gather Problems" that appeared in  ...  We described a simple, pipelined ring algorithm for large, irregular all-gather problems.  ... 
doi:10.1177/1094342009359013 fatcat:npva3zddoncszmj6f6gahxkrom

Memory System Support for Irregular Applications [chapter]

John Carter, Wilson Hsieh, Mark Swanson, Lixin Zhang, Erik Brunvand, Al Davis, Chen-Chi Kuo, Ravindra Kuramkote, Michael Parker, Lambert Schaelicke, Leigh Stoller, Terry Tateyama
1998 Lecture Notes in Computer Science  
Conventional CPU, DRAMs, Conventional CPU and DRAMs, Virtual for (i=0; i<sz; i++) x += A[i][i]; for (i=0; i<sz; i++) x += diagonal[i]; Original code: Impulse code: remap(diagonal, stride, size,  ...  The Impulse con gurable memory controller will enable signi cant performance improvements for irregular applications, because it can be congured to optimize memory accesses on an application-by-application  ...  of irregular problems.  ... 
doi:10.1007/3-540-49530-4_2 fatcat:gmlkfqnwwvbqpnimhn56ysryu4

PIUMA: Programmable Integrated Unified Memory Architecture [article]

Sriram Aananthakrishnan, Nesreen K. Ahmed, Vincent Cave, Marcelo Cintra, Yigit Demir, Kristof Du Bois, Stijn Eyerman, Joshua B. Fryman, Ivan Ganev, Wim Heirman, Hans-Christian Hoppe, Jason Howard (+19 others)
2020 arXiv   pre-print
High performance large scale graph analytics is essential to timely analyze relationships in big data sets.  ...  This paper presents the PIUMA architecture, and provides initial performance estimations, projecting that a PIUMA node will outperform a conventional compute node by one to two orders of magnitude.  ...  These observations call for many simple pipelines, with multi-threading to hide memory latency, see Figure 3 .  ... 
arXiv:2010.06277v1 fatcat:xhzq7hs2dnhlpcveqy6twjk3ou

GCoD: Graph Convolutional Network Acceleration via Dedicated Algorithm and Accelerator Co-Design [article]

Haoran You, Tong Geng, Yongan Zhang, Ang Li, Yingyan Lin
2022 arXiv   pre-print
To this end, this paper proposes a GCN algorithm and accelerator Co-Design framework dubbed GCoD which can largely alleviate the aforementioned GCN irregularity and boost GCNs' inference efficiency.  ...  Specifically, on the algorithm level, GCoD integrates a split and conquer GCN training strategy that polarizes the graphs to be either denser or sparser in local neighborhoods without compromising the  ...  Cheng Wan at Rice University for his help and discussion in the graph reordering algorithm.  ... 
arXiv:2112.11594v2 fatcat:ivnelobzlbgrzgtb4yf5okew3a

Analysis and performance results of computing betweenness centrality on IBM Cyclops64

Guangming Tan, Vugranam C. Sreedhar, Guang R. Gao
2009 Journal of Supercomputing  
This paper presents a joint study of application and architecture to improve the performance and scalability of an irregular application -computing betweenness centrality -on a many-core architecture IBM  ...  Comparing with a conventional parallel algorithm, we get 4X-50X improvement in performance and 16X improvement in scalability on a 128-cores IBM Cyclops64 simulator.  ...  Figure 9 : A demonstration of the parallel pipelining process for the BFS phase of the BC algorithm.  ... 
doi:10.1007/s11227-009-0339-9 fatcat:rtslfwssvzasleg55erlq6mpoa

Just-In-Time Locality and Percolation for Optimizing Irregular Applications on a Manycore Architecture [chapter]

Guangming Tan, Vugranam C. Sreedhar, Guang R. Gao
2008 Lecture Notes in Computer Science  
This paper presents a new technique to optimize locality of irregular programs by leveraging parallelism on a massive many-core architecture -IBM Cyclops64 (C64).  ...  The percolation model opens a door for exploiting locality through parallelism, which is an advantage of the future many-core architecture.  ...  Acknowledgment We would like to thank all reviewers and the shepherd for improving this paper. The authors would like to acknowledge Russo Andrew and Ge Gan at CAPSL for their help.  ... 
doi:10.1007/978-3-540-89740-8_23 fatcat:qzkhkm4zpbdqrkpsqtdxlzwxe4

HLS-based High-Throughput and Work-Efficient Synthesizable Graph Processing Template Pipeline*

Hamzeh Ahangari, Muhammet Mustafa Özdal, Özcan Öztürk
2022 ACM Transactions on Embedded Computing Systems  
While a fixed and clock-wise precisely designed deep-pipeline architecture, written in SystemC, is responsible for processing graph vertices, the user implements the intended iterative graph algorithm  ...  Programming such a hybrid system will be a challenge for most of the non-expert users. High-level language solutions such as Intel OpenCL for FPGA try to address the problem.  ...  ACKNOWLEDGMENTS We thank Intel HARP group, including Paderborn University in Germany, for providing us generous support on Xeon+FPGA platform, as well as required software.  ... 
doi:10.1145/3529256 fatcat:2q4wwqsd7fb5zngjitbhz73smy

A Template-Based Design Methodology for Graph-Parallel Hardware Accelerators

Andrey Ayupov, Serif Yesil, Muhammet Mustafa Ozdal, Taemin Kim, Steven Burns, Ozcan Ozturk
2018 IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems  
For these applications, traditional CPU and GPU architectures suffer in terms of performance and power consumption due to irregular communications, random memory accesses, and load balancing problems.  ...  Important architectural features that are key for energy efficient execution are implemented in a common template.  ...  We also describe a generic pipeline for a baseline HLS architecture in that section. In Section IV, we briefly describe the proposed architecture for irregular graph applications.  ... 
doi:10.1109/tcad.2017.2706562 fatcat:hfnzfbny2zfnnkydmswbvtpnfy

Executing irregular scientific applications on stream architectures

Mattan Erez, Jung Ho Ahn, Jayanth Gummaraju, Mendel Rosenblum, William J. Dally
2007 Proceedings of the 21st annual international conference on Supercomputing - ICS '07  
We study four representative sub-classes of irregular algorithms, including finiteelement and finite-volume methods for modeling physical systems, direct methods for n-body problems, and computations involving  ...  These codes have irregular structures where nodes have a variable number of neighbors, resulting in irregular memory access patterns and irregular control.  ...  However, requiring all data that an inner-loop computation accesses to be gathered ahead of the computation poses a problem for the irregular accesses of unstructured mesh algorithms.  ... 
doi:10.1145/1274971.1274987 dblp:conf/ics/ErezAGRD07 fatcat:c5koegpe2fg6xm2cmwl44vwyze

Progressive Codesign of an Architecture and Compiler Using a Proxy Application

Arpith Jacob, Ravi Nair, Tong Chen, Zehra Sura, Changhoan Kim, Carlo Bertolli, Samuel Antao, Kevin OBrien
2015 2015 27th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)  
Such co-design is commonly done using hand-tuned codes for simple kernels that typically do not capture the nuances of realworld applications or reveal the complexities of programming a heterogeneous system  ...  Its energy efficiency is derived from a combination of its novel scalar-vector data-flow path combined with its simple control-flow path that required the development of a sophisticated compiler, co-designed  ...  Pipelining always pays off compared to a simple list-scheduling technique, however, there is a 66% performance differential in the best run of this kernel compared to Determinant.  ... 
doi:10.1109/sbac-pad.2015.18 dblp:conf/sbac-pad/JacobNCSKBAO15 fatcat:2tsunsbztzbdhjuxn5zobup3wi


Zhisong Fu, Michael Personick, Bryan Thompson
2014 Proceedings of Workshop on GRAph Data management Experiences and Systems - GRADES'14  
Our experiments show that, for many graph analytics algorithms, an implementation, with our abstraction, is up to two orders of magnitude faster than a parallel CPU implementation and is comparable to  ...  High performance graph analytics are critical for a long list of application domains.  ...  The proposed framework provides a simple and flexible API that makes it easy to implement a wide range of graph algorithms.  ... 
doi:10.1145/2621934.2621936 dblp:conf/sigmod/FuTP14 fatcat:273pk6stsjhizjdh3aq65qlhse

Graph Processing on FPGAs: Taxonomy, Survey, Challenges [article]

Maciej Besta, Dimitri Stanojevic, Johannes De Fine Licht, Tal Ben-Nun, Torsten Hoefler
2019 arXiv   pre-print
The sheer size of such datasets, combined with the irregular nature of graph processing, poses unique challenges for the runtime and the consumed power.  ...  This is reflected by the recent interest in developing various graph algorithms and graph processing frameworks on FPGAs.  ...  There are three stages in each pipeline which perform slightly different for the scatter and the gather phase.  ... 
arXiv:1903.06697v3 fatcat:f5usapd45jgqpf7ynlz4w6e4si

A framework for FPGA acceleration of large graph problems: Graphlet counting case study

Brahim Betkaoui, David B. Thomas, Wayne Luk, Natasa Przulj
2011 2011 International Conference on Field-Programmable Technology  
This speedup includes all software and IO overhead required, and reduces execution time for this common bioinformatics algorithm from about 2 hours to just 12 minutes.  ...  for large graphs.  ...  ACKNOWLEDGMENT The support of Imperial College London Research Excellence Award, the FP7 REFELCT (Rendering FPGAs for Multi-Core Embedded Computing) Project, the UK Engineering and Physical Sciences Research  ... 
doi:10.1109/fpt.2011.6132667 dblp:conf/fpt/BetkaouiTLP11 fatcat:isjetlbo5bebxckk4l42ot47ia

The Graphics Unit of the INTEL 180860 [article]

Ulrich Kursawe
1989 Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware - HWWS '04  
The Intel 180860 is a very powerful RISe processor, designed for applications that require a large amount of floating point and integer calculations.  ...  of the same algorithm.  ...  When used as a vector processor, the 180860 needs (as all vector processors) special preparation of data, like gathering a large amount of input data before processing them in a program using pipelined  ... 
doi:10.2312/eggh/eggh89/229-247 fatcat:7jg7pnqdnfdvrhiujy35s2hsna
« Previous Showing results 1 — 15 out of 5,415 results