A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2011; you can also visit the original URL.
The file type is application/pdf
.
Filters
A Simple, Pipelined Algorithm for Large, Irregular All-gather Problems
[chapter]
2008
Lecture Notes in Computer Science
We present and evaluate a new, simple, pipelined algorithm for large, irregular all-gather problems, useful for the implementation of the MPI Allgatherv collective operation of MPI. ...
The algorithm can be viewed as an adaptation of a linear ring algorithm for regular allgather problems for single-ported, clustered multiprocessors to the irregular problem. ...
The pipelined algorithm performs even better on this machine.
Concluding Remarks We described a simple, pipelined ring algorithm for large, irregular all-gather problems. ...
doi:10.1007/978-3-540-87475-1_16
fatcat:7vyec2gtgbgtnjfmfuylukexou
A Pipelined Algorithm for Large, Irregular All-Gather Problems
2010
The international journal of high performance computing applications
We describe and evaluate a new, pipelined algorithm for large, irregular all-gather problems. ...
. * This paper is a revised version of the conference presentation "A Simple, Pipelined Algorithm for Large, Irregular All-gather Problems" that appeared in ...
We described a simple, pipelined ring algorithm for large, irregular all-gather problems. ...
doi:10.1177/1094342009359013
fatcat:npva3zddoncszmj6f6gahxkrom
Memory System Support for Irregular Applications
[chapter]
1998
Lecture Notes in Computer Science
Conventional CPU, DRAMs, Conventional CPU and DRAMs, Virtual for (i=0; i<sz; i++) x += A[i][i]; for (i=0; i<sz; i++) x += diagonal[i]; Original code: Impulse code: remap(diagonal, stride, size, ...
The Impulse con gurable memory controller will enable signi cant performance improvements for irregular applications, because it can be congured to optimize memory accesses on an application-by-application ...
of irregular problems. ...
doi:10.1007/3-540-49530-4_2
fatcat:gmlkfqnwwvbqpnimhn56ysryu4
PIUMA: Programmable Integrated Unified Memory Architecture
[article]
2020
arXiv
pre-print
High performance large scale graph analytics is essential to timely analyze relationships in big data sets. ...
This paper presents the PIUMA architecture, and provides initial performance estimations, projecting that a PIUMA node will outperform a conventional compute node by one to two orders of magnitude. ...
These observations call for many simple pipelines, with multi-threading to hide memory latency, see Figure 3 . ...
arXiv:2010.06277v1
fatcat:xhzq7hs2dnhlpcveqy6twjk3ou
GCoD: Graph Convolutional Network Acceleration via Dedicated Algorithm and Accelerator Co-Design
[article]
2022
arXiv
pre-print
To this end, this paper proposes a GCN algorithm and accelerator Co-Design framework dubbed GCoD which can largely alleviate the aforementioned GCN irregularity and boost GCNs' inference efficiency. ...
Specifically, on the algorithm level, GCoD integrates a split and conquer GCN training strategy that polarizes the graphs to be either denser or sparser in local neighborhoods without compromising the ...
Cheng Wan at Rice University for his help and discussion in the graph reordering algorithm. ...
arXiv:2112.11594v2
fatcat:ivnelobzlbgrzgtb4yf5okew3a
Analysis and performance results of computing betweenness centrality on IBM Cyclops64
2009
Journal of Supercomputing
This paper presents a joint study of application and architecture to improve the performance and scalability of an irregular application -computing betweenness centrality -on a many-core architecture IBM ...
Comparing with a conventional parallel algorithm, we get 4X-50X improvement in performance and 16X improvement in scalability on a 128-cores IBM Cyclops64 simulator. ...
Figure 9 : A demonstration of the parallel pipelining process for the BFS phase of the BC algorithm. ...
doi:10.1007/s11227-009-0339-9
fatcat:rtslfwssvzasleg55erlq6mpoa
Just-In-Time Locality and Percolation for Optimizing Irregular Applications on a Manycore Architecture
[chapter]
2008
Lecture Notes in Computer Science
This paper presents a new technique to optimize locality of irregular programs by leveraging parallelism on a massive many-core architecture -IBM Cyclops64 (C64). ...
The percolation model opens a door for exploiting locality through parallelism, which is an advantage of the future many-core architecture. ...
Acknowledgment We would like to thank all reviewers and the shepherd for improving this paper. The authors would like to acknowledge Russo Andrew and Ge Gan at CAPSL for their help. ...
doi:10.1007/978-3-540-89740-8_23
fatcat:qzkhkm4zpbdqrkpsqtdxlzwxe4
HLS-based High-Throughput and Work-Efficient Synthesizable Graph Processing Template Pipeline*
2022
ACM Transactions on Embedded Computing Systems
While a fixed and clock-wise precisely designed deep-pipeline architecture, written in SystemC, is responsible for processing graph vertices, the user implements the intended iterative graph algorithm ...
Programming such a hybrid system will be a challenge for most of the non-expert users. High-level language solutions such as Intel OpenCL for FPGA try to address the problem. ...
ACKNOWLEDGMENTS We thank Intel HARP group, including Paderborn University in Germany, for providing us generous support on Xeon+FPGA platform, as well as required software. ...
doi:10.1145/3529256
fatcat:2q4wwqsd7fb5zngjitbhz73smy
A Template-Based Design Methodology for Graph-Parallel Hardware Accelerators
2018
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
For these applications, traditional CPU and GPU architectures suffer in terms of performance and power consumption due to irregular communications, random memory accesses, and load balancing problems. ...
Important architectural features that are key for energy efficient execution are implemented in a common template. ...
We also describe a generic pipeline for a baseline HLS architecture in that section. In Section IV, we briefly describe the proposed architecture for irregular graph applications. ...
doi:10.1109/tcad.2017.2706562
fatcat:hfnzfbny2zfnnkydmswbvtpnfy
Executing irregular scientific applications on stream architectures
2007
Proceedings of the 21st annual international conference on Supercomputing - ICS '07
We study four representative sub-classes of irregular algorithms, including finiteelement and finite-volume methods for modeling physical systems, direct methods for n-body problems, and computations involving ...
These codes have irregular structures where nodes have a variable number of neighbors, resulting in irregular memory access patterns and irregular control. ...
However, requiring all data that an inner-loop computation accesses to be gathered ahead of the computation poses a problem for the irregular accesses of unstructured mesh algorithms. ...
doi:10.1145/1274971.1274987
dblp:conf/ics/ErezAGRD07
fatcat:c5koegpe2fg6xm2cmwl44vwyze
Progressive Codesign of an Architecture and Compiler Using a Proxy Application
2015
2015 27th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)
Such co-design is commonly done using hand-tuned codes for simple kernels that typically do not capture the nuances of realworld applications or reveal the complexities of programming a heterogeneous system ...
Its energy efficiency is derived from a combination of its novel scalar-vector data-flow path combined with its simple control-flow path that required the development of a sophisticated compiler, co-designed ...
Pipelining always pays off compared to a simple list-scheduling technique, however, there is a 66% performance differential in the best run of this kernel compared to Determinant. ...
doi:10.1109/sbac-pad.2015.18
dblp:conf/sbac-pad/JacobNCSKBAO15
fatcat:2tsunsbztzbdhjuxn5zobup3wi
Our experiments show that, for many graph analytics algorithms, an implementation, with our abstraction, is up to two orders of magnitude faster than a parallel CPU implementation and is comparable to ...
High performance graph analytics are critical for a long list of application domains. ...
The proposed framework provides a simple and flexible API that makes it easy to implement a wide range of graph algorithms. ...
doi:10.1145/2621934.2621936
dblp:conf/sigmod/FuTP14
fatcat:273pk6stsjhizjdh3aq65qlhse
Graph Processing on FPGAs: Taxonomy, Survey, Challenges
[article]
2019
arXiv
pre-print
The sheer size of such datasets, combined with the irregular nature of graph processing, poses unique challenges for the runtime and the consumed power. ...
This is reflected by the recent interest in developing various graph algorithms and graph processing frameworks on FPGAs. ...
There are three stages in each pipeline which perform slightly different for the scatter and the gather phase. ...
arXiv:1903.06697v3
fatcat:f5usapd45jgqpf7ynlz4w6e4si
A framework for FPGA acceleration of large graph problems: Graphlet counting case study
2011
2011 International Conference on Field-Programmable Technology
This speedup includes all software and IO overhead required, and reduces execution time for this common bioinformatics algorithm from about 2 hours to just 12 minutes. ...
for large graphs. ...
ACKNOWLEDGMENT The support of Imperial College London Research Excellence Award, the FP7 REFELCT (Rendering FPGAs for Multi-Core Embedded Computing) Project, the UK Engineering and Physical Sciences Research ...
doi:10.1109/fpt.2011.6132667
dblp:conf/fpt/BetkaouiTLP11
fatcat:isjetlbo5bebxckk4l42ot47ia
The Graphics Unit of the INTEL 180860
[article]
1989
Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware - HWWS '04
The Intel 180860 is a very powerful RISe processor, designed for applications that require a large amount of floating point and integer calculations. ...
of the same algorithm. ...
When used as a vector processor, the 180860 needs (as all vector processors) special preparation of data, like gathering a large amount of input data before processing them in a program using pipelined ...
doi:10.2312/eggh/eggh89/229-247
fatcat:7jg7pnqdnfdvrhiujy35s2hsna
« Previous
Showing results 1 — 15 out of 5,415 results